• Open

    Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds. (arXiv:2301.12485v2 [q-bio.BM] UPDATED)
    Proteins power a vast array of functional processes in living cells. The capability to create new proteins with designed structures and functions would thus enable the engineering of cellular behavior and development of protein-based therapeutics and materials. Structure-based protein design aims to find structures that are designable (can be realized by a protein sequence), novel (have dissimilar geometry from natural proteins), and diverse (span a wide range of geometries). While advances in protein structure prediction have made it possible to predict structures of novel protein sequences, the combinatorially large space of sequences and structures limits the practicality of search-based methods. Generative models provide a compelling alternative, by implicitly learning the low-dimensional structure of complex data distributions. Here, we leverage recent advances in denoising diffusion probabilistic models and equivariant neural networks to develop Genie, a generative model of protein structures that performs discrete-time diffusion using a cloud of oriented reference frames in 3D space. Through in silico evaluations, we demonstrate that Genie generates protein backbones that are more designable, novel, and diverse than existing models. This indicates that Genie is capturing key aspects of the distribution of protein structure space and facilitates protein design with high success rates. Code for generating new proteins and training new versions of Genie is available at https://github.com/aqlaboratory/genie.  ( 2 min )
    Anti-Symmetric DGN: a stable architecture for Deep Graph Networks. (arXiv:2210.09789v2 [cs.LG] UPDATED)
    Deep Graph Networks (DGNs) currently dominate the research landscape of learning from graphs, due to their efficiency and ability to implement an adaptive message-passing scheme between the nodes. However, DGNs are typically limited in their ability to propagate and preserve long-term dependencies between nodes, i.e., they suffer from the over-squashing phenomena. This reduces their effectiveness, since predictive problems may require to capture interactions at different, and possibly large, radii in order to be effectively solved. In this work, we present Anti-Symmetric Deep Graph Networks (A-DGNs), a framework for stable and non-dissipative DGN design, conceived through the lens of ordinary differential equations. We give theoretical proof that our method is stable and non-dissipative, leading to two key results: long-range information between nodes is preserved, and no gradient vanishing or explosion occurs in training. We empirically validate the proposed approach on several graph benchmarks, showing that A-DGN yields to improved performance and enables to learn effectively even when dozens of layers are used.  ( 2 min )
    Knowledge Extraction in Low-Resource Scenarios: Survey and Perspective. (arXiv:2202.08063v3 [cs.CL] CROSS LISTED)
    Knowledge Extraction (KE), aiming to extract structural information from unstructured texts, often suffers from data scarcity and emerging unseen types, i.e., low-resource scenarios. Many neural approaches to low-resource KE have been widely investigated and achieved impressive performance. In this paper, we present a literature review towards KE in low-resource scenarios, and systematically categorize existing works into three paradigms: (1) exploiting higher-resource data, (2) exploiting stronger models, and (3) exploiting data and models together. In addition, we highlight promising applications and outline some potential directions for future research. We hope that our survey can help both the academic and industrial communities to better understand this field, inspire more ideas, and boost broader applications.  ( 2 min )
    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. (arXiv:2210.02720v2 [cs.LG] UPDATED)
    Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.  ( 2 min )
    A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs. (arXiv:2206.04798v2 [cs.AI] UPDATED)
    Reasoning on large-scale knowledge graphs has been long dominated by embedding methods. While path-based methods possess the inductive capacity that embeddings lack, they suffer from the scalability issue due to the exponential number of paths. Here we present A*Net, a scalable path-based method for knowledge graph reasoning. Inspired by the A* algorithm for shortest path problems, our A*Net learns a priority function to select important nodes and edges at each iteration, to reduce time and memory footprint for both training and inference. The ratio of selected nodes and edges can be specified to trade off between performance and efficiency. Experiments on both transductive and inductive knowledge graph reasoning benchmarks show that A*Net achieves competitive performance with existing state-of-the-art path-based methods, while merely visiting 10% nodes and 10% edges at each iteration. On a million-scale dataset ogbl-wikikg2, A*Net achieves competitive performance with embedding methods and converges faster. To our best knowledge, A*Net is the first path-based method for knowledge graph reasoning at such a scale.  ( 2 min )
    One-shot domain adaptation in video-based assessment of surgical skills. (arXiv:2301.00812v2 [cs.CV] UPDATED)
    Deep Learning (DL) has achieved automatic and objective assessment of surgical skills. However, DL models are data-hungry and restricted to their training domain. This prevents them from transitioning to new tasks where data is limited. Hence, domain adaptation is crucial to implement DL in real life. Here, we propose a meta-learning model, A-VBANet, that can deliver domain-agnostic surgical skill classification via one-shot learning. We develop the A-VBANet on five laparoscopic and robotic surgical simulators. Additionally, we test it on operating room (OR) videos of laparoscopic cholecystectomy. Our model successfully adapts with accuracies up to 99.5% in one-shot and 99.9% in few-shot settings for simulated tasks and 89.7% for laparoscopic cholecystectomy. For the first time, we provide a domain-agnostic procedure for video-based assessment of surgical skills. A significant implication of this approach is that it allows the use of data from surgical simulators to assess performance in the operating room.  ( 2 min )
    Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners. (arXiv:2212.04979v2 [cs.CV] UPDATED)
    This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to "flattened frame embeddings", yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates $N$ token embeddings per frame for totally $T$ video frames. We flatten $(N \times T)$ token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT, ActivityNet Captions and VATEX. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA) and video captioning (MSR-VTT, ActivityNet, VATEX, Youcook2). Our approach establishes a simple video-text baseline for future research.  ( 2 min )
    On the Opportunity of Causal Deep Generative Models: A Survey and Future Directions. (arXiv:2301.12351v2 [cs.LG] UPDATED)
    Deep generative models have gained popularity in recent years due to their ability to accurately replicate inherent empirical distributions and yield novel samples. In particular, certain advances are proposed wherein the model engenders data examples following specified attributes. Nevertheless, several challenges still exist and are to be overcome, i.e., difficulty in extrapolating out-of-sample data and insufficient learning of disentangled representations. Structural causal models (SCMs), on the other hand, encapsulate the causal factors that govern a generative process and characterize a generative model based on causal relationships, providing crucial insights for addressing the current obstacles in deep generative models. In this paper, we present a comprehensive survey of Causal deep Generative Models (CGMs), which combine SCMs and deep generative models in a way that boosts several trustworthy properties such as robustness, fairness, and interpretability. We provide an overview of the recent advances in CGMs, categorize them based on generative types, and discuss how causality is introduced into the family of deep generative models. We also explore potential avenues for future research in this field.  ( 2 min )
    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v3 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning. (arXiv:2205.11027v2 [cs.LG] UPDATED)
    In offline reinforcement learning (RL), one detrimental issue to policy learning is the error accumulation of deep Q function in out-of-distribution (OOD) areas. Unfortunately, existing offline RL methods are often over-conservative, inevitably hurting generalization performance outside data distribution. In our study, one interesting observation is that deep Q functions approximate well inside the convex hull of training data. Inspired by this, we propose a new method, DOGE (Distance-sensitive Offline RL with better GEneralization). DOGE marries dataset geometry with deep function approximators in offline RL, and enables exploitation in generalizable OOD areas rather than strictly constraining policy within data distribution. Specifically, DOGE trains a state-conditioned distance function that can be readily plugged into standard actor-critic methods as a policy constraint. Simple yet elegant, our algorithm enjoys better generalization compared to state-of-the-art methods on D4RL benchmarks. Theoretical analysis demonstrates the superiority of our approach to existing methods that are solely based on data distribution or support constraints.  ( 2 min )
    S$^3$NN: Time Step Reduction of Spiking Surrogate Gradients for Training Energy Efficient Single-Step Spiking Neural Networks. (arXiv:2201.10879v2 [cs.LG] UPDATED)
    As the scales of neural networks increase, techniques that enable them to run with low computational cost and energy efficiency are required. From such demands, various efficient neural network paradigms, such as spiking neural networks (SNNs) or binary neural networks (BNNs), have been proposed. However, they have sticky drawbacks, such as degraded inference accuracy and latency. To solve these problems, we propose a single-step spiking neural network (S$^3$NN), an energy-efficient neural network with low computational cost and high precision. The proposed S$^3$NN processes the information between hidden layers by spikes as SNNs. Nevertheless, it has no temporal dimension so that there is no latency within training and inference phases as BNNs. Thus, the proposed S$^3$NN has a lower computational cost than SNNs that require time-series processing. However, S$^3$NN cannot adopt na\"{i}ve backpropagation algorithms due to the non-differentiability nature of spikes. We deduce a suitable neuron model by reducing the surrogate gradient for multi-time step SNNs to a single-time step. We experimentally demonstrated that the obtained surrogate gradient allows S$^3$NN to be trained appropriately. We also showed that the proposed S$^3$NN could achieve comparable accuracy to full-precision networks while being highly energy-efficient.  ( 2 min )
    Consistent Range Approximation for Fair Predictive Modeling. (arXiv:2212.10839v2 [cs.LG] UPDATED)
    This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.  ( 2 min )
    Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. (arXiv:2206.09889v3 [cs.MA] UPDATED)
    We introduce Nocturne, a new 2D driving simulator for investigating multi-agent coordination under partial observability. The focus of Nocturne is to enable research into inference and theory of mind in real-world multi-agent settings without the computational overhead of computer vision and feature extraction from images. Agents in this simulator only observe an obstructed view of the scene, mimicking human visual sensing constraints. Unlike existing benchmarks that are bottlenecked by rendering human-like observations directly using a camera input, Nocturne uses efficient intersection methods to compute a vectorized set of visible features in a C++ back-end, allowing the simulator to run at over 2000 steps-per-second. Using open-source trajectory and map data, we construct a simulator to load and replay arbitrary trajectories and scenes from real-world driving data. Using this environment, we benchmark reinforcement-learning and imitation-learning agents and demonstrate that the agents are quite far from human-level coordination ability and deviate significantly from the expert trajectories.
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v2 [physics.ao-ph] UPDATED)
    Machine learning models have been employed to perform either physics-free data-driven or hybrid dynamical downscaling of climate data. Most of these implementations operate over relatively small downscaling factors because of the challenge of recovering fine-scale information from coarse data. This limits their compatibility with many global climate model outputs, often available between $\sim$50--100 km resolution, to scales of interest such as cloud resolving or urban scales. This study systematically examines the capability of convolutional neural networks (CNNs) to downscale surface wind speed data over land surface from different coarse resolutions (25 km, 48 km, and 100 km resolution) to 3 km. For each downscaling factor, we consider three CNN configurations that generate super-resolved predictions of fine-scale wind speed, which take between 1 to 3 input fields: coarse wind speed, fine-scale topography, and diurnal cycle. In addition to fine-scale wind speeds, probability density function parameters are generated, through which sample wind speeds can be generated accounting for the intrinsic stochasticity of wind speed. For generalizability assessment, CNN models are tested on regions with different topography and climate that are unseen during training. The evaluation of super-resolved predictions focuses on subgrid-scale variability and the recovery of extremes. Models with coarse wind and fine topography as inputs exhibit the best performance compared with other model configurations, operating across the same downscaling factor. Our diurnal cycle encoding results in lower out-of-sample generalizability compared with other input configurations.  ( 2 min )
    Learning Counterfactually Invariant Predictors. (arXiv:2207.09768v2 [cs.LG] UPDATED)
    Counterfactual invariance has proven an essential property for predictors that are fair, robust, and generalizable in the real world. We propose a general definition of counterfactual invariance and provide simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), based on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various types of data including tabular, high-dimensional, and real-world dataset.  ( 2 min )
    A Systematic Survey of Molecular Pre-trained Models. (arXiv:2210.16484v2 [cs.LG] UPDATED)
    Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, ranging from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of MPMs. We first highlight the limitations of training molecular representation models from scratch to motivate MPM studies. Next, we systematically review recent advances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.  ( 2 min )
    Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences. (arXiv:2210.15906v3 [cs.AI] UPDATED)
    Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement on AI agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. The supplementary video and source code are available at: https://guansuns.github.io/pages/rba.
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v2 [stat.ML] UPDATED)
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this corresponds to an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and propose a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on two different image datasets -- the CheXpert chest X-ray dataset and the Colored MNIST dataset -- and show a significant improvement over baseline methods. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .
    IR-MCL: Implicit Representation-Based Online Global Localization. (arXiv:2210.03113v2 [cs.RO] UPDATED)
    Determining the state of a mobile robot is an essential building block of robot navigation systems. In this paper, we address the problem of estimating the robots pose in an indoor environment using 2D LiDAR data and investigate how modern environment models can improve gold standard Monte-Carlo localization (MCL) systems. We propose a neural occupancy field to implicitly represent the scene using a neural network. With the pretrained network, we can synthesize 2D LiDAR scans for an arbitrary robot pose through volume rendering. Based on the implicit representation, we can obtain the similarity between a synthesized and actual scan as an observation model and integrate it into an MCL system to perform accurate localization. We evaluate our approach on self-recorded datasets and three publicly available ones. We show that we can accurately and efficiently localize a robot using our approach surpassing the localization performance of state-of-the-art methods. The experiments suggest that the presented implicit representation is able to predict more accurate 2D LiDAR scans leading to an improved observation model for our particle filter-based localization. The code of our approach will be available at: https://github.com/PRBonn/ir-mcl.
    Causal Modeling of Policy Interventions From Sequences of Treatments and Outcomes using Gaussian Processes. (arXiv:2209.04142v4 [cs.LG] UPDATED)
    A treatment policy defines when and what treatments are applied to affect some outcome of interest. Data-driven decision-making requires the ability to predict what happens if a policy is changed. Existing methods that predict how the outcome evolves under different scenarios assume that the tentative sequences of future treatments are fixed in advance, while in practice the treatments are determined stochastically by a policy and may depend for example on the efficiency of previous treatments. Therefore, the current methods are not applicable if the treatment policy is unknown or a counterfactual analysis is needed. To handle these limitations, we model the treatments and outcomes jointly in continuous time, by combining Gaussian processes and point processes. Our model enables the estimation of a treatment policy from observational sequences of treatments and outcomes, and it can predict the interventional and counterfactual progression of the outcome after an intervention on the treatment policy (in contrast with the causal effect of a single treatment). We show with real-world and semi-synthetic data on blood glucose progression that our method can answer causal queries more accurately than existing alternatives.
    Vicarious Offense and Noise Audit of Offensive Speech Classifiers. (arXiv:2301.12534v2 [cs.CL] UPDATED)
    This paper examines social web content moderation from two key perspectives: automated methods (machine moderators) and human evaluators (human moderators). We conduct a noise audit at an unprecedented scale using nine machine moderators trained on well-known offensive speech data sets evaluated on a corpus sampled from 92 million YouTube comments discussing a multitude of issues relevant to US politics. We introduce a first-of-its-kind data set of vicarious offense. We ask annotators: (1) if they find a given social media post offensive; and (2) how offensive annotators sharing different political beliefs would find the same content. Our experiments with machine moderators reveal that moderation outcomes wildly vary across different machine moderators. Our experiments with human moderators suggest that (1) political leanings considerably affect first-person offense perspective; (2) Republicans are the worst predictors of vicarious offense; (3) predicting vicarious offense for the Republicans is most challenging than predicting vicarious offense for the Independents and the Democrats; and (4) disagreement across political identity groups considerably increases when sensitive issues such as reproductive rights or gun control/rights are discussed. Both experiments suggest that offense, is indeed, highly subjective and raise important questions concerning content moderation practices.  ( 2 min )
    PDEBENCH: An Extensive Benchmark for Scientific Machine Learning. (arXiv:2210.07182v4 [cs.LG] UPDATED)
    Machine learning-based modeling of physical systems has experienced increased interest in recent years. Despite some impressive progress, there is still a lack of benchmarks for Scientific ML that are easy to use but still challenging and representative of a wide range of problems. We introduce PDEBench, a benchmark suite of time-dependent simulation tasks based on Partial Differential Equations (PDEs). PDEBench comprises both code and data to benchmark the performance of novel machine learning models against both classical numerical simulations and machine learning baselines. Our proposed set of benchmark problems contribute the following unique features: (1) A much wider range of PDEs compared to existing benchmarks, ranging from relatively common examples to more realistic and difficult problems; (2) much larger ready-to-use datasets compared to prior work, comprising multiple simulation runs across a larger number of initial and boundary conditions and PDE parameters; (3) more extensible source codes with user-friendly APIs for data generation and baseline results with popular machine learning models (FNO, U-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to extend the benchmark freely for their own purposes using a standardized API and to compare the performance of new models to existing baseline methods. We also propose new evaluation metrics with the aim to provide a more holistic understanding of learning methods in the context of Scientific ML. With those metrics we identify tasks which are challenging for recent ML methods and propose these tasks as future challenges for the community. The code is available at https://github.com/pdebench/PDEBench.
    SketchySGD: Reliable Stochastic Optimization via Robust Curvature Estimates. (arXiv:2211.08597v3 [math.OC] UPDATED)
    We introduce SketchySGD, a stochastic quasi-Newton method that uses sketching to approximate the curvature of the loss function. SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems show that SketchySGD can achieve comparable or better results to popular stochastic gradient methods with minimal hyperparameter tuning. The robustness of SketchySGD to hyperparameters is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance.
    RL4ReAl: Reinforcement Learning for Register Allocation. (arXiv:2204.02013v2 [cs.LG] UPDATED)
    We aim to automate decades of research and experience in register allocation, leveraging machine learning. We tackle this problem by embedding a multi-agent reinforcement learning algorithm within LLVM, training it with the state of the art techniques. We formalize the constraints that precisely define the problem for a given instruction-set architecture, while ensuring that the generated code preserves semantic correctness. We also develop a gRPC based framework providing a modular and efficient compiler interface for training and inference. Our approach is architecture independent: we show experimental results targeting Intel x86 and ARM AArch64. Our results match or out-perform the heavily tuned, production-grade register allocators of LLVM.  ( 2 min )
    SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. (arXiv:2210.01108v2 [cs.CL] UPDATED)
    We propose MINT, a new Multilingual INTimacy analysis dataset covering 13,372 tweets in 10 languages including English, French, Spanish, Italian, Portuguese, Korean, Dutch, Chinese, Hindi, and Arabic. We benchmarked a list of popular multilingual pre-trained language models. The dataset is released along with the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis (https://sites.google.com/umich.edu/semeval-2023-tweet-intimacy).
    New Machine Learning Techniques for Simulation-Based Inference: InferoStatic Nets, Kernel Score Estimation, and Kernel Likelihood Ratio Estimation. (arXiv:2210.01680v2 [stat.ML] UPDATED)
    We propose an intuitive, machine-learning approach to multiparameter inference, dubbed the InferoStatic Networks (ISN) method, to model the score and likelihood ratio estimators in cases when the probability density can be sampled but not computed directly. The ISN uses a backend neural network that models a scalar function called the inferostatic potential $\varphi$. In addition, we introduce new strategies, respectively called Kernel Score Estimation (KSE) and Kernel Likelihood Ratio Estimation (KLRE), to learn the score and the likelihood ratio functions from simulated data. We illustrate the new techniques with some toy examples and compare to existing approaches in the literature. We mention en passant some new loss functions that optimally incorporate latent information from simulations into the training procedure.
    Conditional Antibody Design as 3D Equivariant Graph Translation. (arXiv:2208.06073v4 [q-bio.BM] UPDATED)
    Antibody design is valuable for therapeutic usage and biological research. Existing deep-learning-based methods encounter several key issues: 1) incomplete context for Complementarity-Determining Regions (CDRs) generation; 2) incapability of capturing the entire 3D geometry of the input structure; 3) inefficient prediction of the CDR sequences in an autoregressive manner. In this paper, we propose Multi-channel Equivariant Attention Network (MEAN) to co-design 1D sequences and 3D structures of CDRs. To be specific, MEAN formulates antibody design as a conditional graph translation problem by importing extra components including the target antigen and the light chain of the antibody. Then, MEAN resorts to E(3)-equivariant message passing along with a proposed attention mechanism to better capture the geometrical correlation between different components. Finally, it outputs both the 1D sequences and 3D structure via a multi-round progressive full-shot scheme, which enjoys more efficiency and precision against previous autoregressive approaches. Our method significantly surpasses state-of-the-art models in sequence and structure modeling, antigen-binding CDR design, and binding affinity optimization. Specifically, the relative improvement to baselines is about 23% in antigen-binding CDR design and 34% for affinity optimization.
    Testing Rare Downstream Safety Violations via Upstream Adaptive Sampling of Perception Error Models. (arXiv:2209.09674v3 [cs.RO] UPDATED)
    Testing black-box perceptual-control systems in simulation faces two difficulties. Firstly, perceptual inputs in simulation lack the fidelity of real-world sensor inputs. Secondly, for a reasonably accurate perception system, encountering a rare failure trajectory may require running infeasibly many simulations. This paper combines perception error models -- surrogates for a sensor-based detection system -- with state-dependent adaptive importance sampling. This allows us to efficiently assess the rare failure probabilities for real-world perceptual control systems within simulation. Our experiments with an autonomous braking system equipped with an RGB obstacle-detector show that our method can calculate accurate failure probabilities with an inexpensive number of simulations. Further, we show how choice of safety metric can influence the process of learning proposal distributions capable of reliably sampling high-probability failures.
    Physically Consistent Learning of Conservative Lagrangian Systems with Gaussian Processes. (arXiv:2206.12272v3 [cs.LG] UPDATED)
    This paper proposes a physically consistent Gaussian Process (GP) enabling the identification of uncertain Lagrangian systems. The function space is tailored according to the energy components of the Lagrangian and the differential equation structure, analytically guaranteeing physical and mathematical properties such as energy conservation and quadratic form. The novel formulation of Cholesky decomposed matrix kernels allow the probabilistic preservation of positive definiteness. Only differential input-to-output measurements of the function map are required while Gaussian noise is permitted in torques, velocities, and accelerations. We demonstrate the effectiveness of the approach in numerical simulation.
    Socially Fair Reinforcement Learning. (arXiv:2208.12584v2 [cs.LG] UPDATED)
    We consider the problem of episodic reinforcement learning where there are multiple stakeholders with different reward functions. Our goal is to output a policy that is socially fair with respect to different reward functions. Prior works have proposed different objectives that a fair policy must optimize including minimum welfare, and generalized Gini welfare. We first take an axiomatic view of the problem, and propose four axioms that any such fair objective must satisfy. We show that the Nash social welfare is the unique objective that uniquely satisfies all four objectives, whereas prior objectives fail to satisfy all four axioms. We then consider the learning version of the problem where the underlying model i.e. Markov decision process is unknown. We consider the problem of minimizing regret with respect to the fair policies maximizing three different fair objectives -- minimum welfare, generalized Gini welfare, and Nash social welfare. Based on optimistic planning, we propose a generic learning algorithm and derive its regret bound with respect to the three different policies. For the objective of Nash social welfare, we also derive a lower bound in regret that grows exponentially with $n$, the number of agents. Finally, we show that for the objective of minimum welfare, one can improve regret by a factor of $O(H)$ for a weaker notion of regret.
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v3 [stat.ME] UPDATED)
    We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.
    Searching for the Essence of Adversarial Perturbations. (arXiv:2205.15357v3 [cs.LG] UPDATED)
    Neural networks have demonstrated state-of-the-art performance in various machine learning fields. However, the introduction of malicious perturbations in input data, known as adversarial examples, has been shown to deceive neural network predictions. This poses potential risks for real-world applications such as autonomous driving and text identification. In order to mitigate these risks, a comprehensive understanding of the mechanisms underlying adversarial examples is essential. In this study, we demonstrate that adversarial perturbations contain human-recognizable information, which is the key conspirator responsible for a neural network's incorrect prediction, in contrast to the widely held belief that human-unidentifiable characteristics play a critical role in fooling a network. This concept of human-recognizable characteristics enables us to explain key features of adversarial perturbations, including their existence, transferability among different neural networks, and increased interpretability for adversarial training. We also uncover two unique properties of adversarial perturbations that deceive neural networks: masking and generation. Additionally, a special class, the complementary class, is identified when neural networks classify input images. The presence of human-recognizable information in adversarial perturbations allows researchers to gain insight into the working principles of neural networks and may lead to the development of techniques for detecting and defending against adversarial attacks.
    GOOD: Exploring Geometric Cues for Detecting Objects in an Open World. (arXiv:2212.11720v3 [cs.CV] UPDATED)
    We address the task of open-world class-agnostic object detection, i.e., detecting every object in an image by learning from a limited number of base object classes. State-of-the-art RGB-based models suffer from overfitting the training classes and often fail at detecting novel-looking objects. This is because RGB-based models primarily rely on appearance similarity to detect novel objects and are also prone to overfitting short-cut cues such as textures and discriminative parts. To address these shortcomings of RGB-based object detectors, we propose incorporating geometric cues such as depth and normals, predicted by general-purpose monocular estimators. Specifically, we use the geometric cues to train an object proposal network for pseudo-labeling unannotated novel objects in the training set. Our resulting Geometry-guided Open-world Object Detector (GOOD) significantly improves detection recall for novel object categories and already performs well with only a few training classes. Using a single "person" class for training on the COCO dataset, GOOD surpasses SOTA methods by 5.0% AR@100, a relative improvement of 24%.  ( 2 min )
    Realizable Learning is All You Need. (arXiv:2111.04746v3 [cs.LG] UPDATED)
    The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.  ( 2 min )
    Overcoming the Long Horizon Barrier for Sample-Efficient Reinforcement Learning with Latent Low-Rank Structure. (arXiv:2206.03569v3 [cs.LG] UPDATED)
    The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an $\epsilon$-optimal policy is $\tilde{\Omega}\left(|S||A|H^3 / \epsilon^2\right)$ over worst case instances of an MDP with state space $S$, action space $A$, and horizon $H$. We consider a class of MDPs for which the associated optimal $Q^*$ function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in $|S|$ and $|A|$ due to the low rank structure, we show that without imposing further assumptions beyond low rank of $Q^*$, if one is constrained to estimate the $Q$ function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon $H$ to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of $\tilde{O}\left((|S|+|A|)\mathrm{poly}(d,H)/\epsilon^2\right)$ for a rank $d$ setting, which is minimax optimal with respect to the scaling of $|S|, |A|$, and $\epsilon$. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v2 [stat.ML] UPDATED)
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.
    Self-Programming Artificial Intelligence Using Code-Generating Language Models. (arXiv:2205.00167v2 [cs.AI] UPDATED)
    Recent progress in large-scale language models has enabled breakthroughs in previously intractable computer programming tasks. Prior work in meta-learning and neural architecture search has led to substantial successes across various task domains, spawning myriad approaches for algorithmically optimizing the design and learning dynamics of deep learning models. At the intersection of these research areas, we implement a code-generating language model with the ability to modify its own source code. Self-programming AI algorithms have been of interest since the dawn of AI itself. Although various theoretical formulations of generalized self-programming AI have been posed, no such system has been successfully implemented to date under real-world computational constraints. Applying AI-based code generation to AI itself, we develop and experimentally validate the first practical implementation of a self-programming AI system. We empirically show that a self-programming AI implemented using a code generation model can successfully modify its own source code to improve performance and program sub-models to perform auxiliary tasks. Our model can self-modify various properties including model architecture, computational capacity, and learning dynamics.  ( 2 min )
    PyGlove: Efficiently Exchanging ML Ideas as Code. (arXiv:2302.01918v1 [cs.LG])
    The increasing complexity and scale of machine learning (ML) has led to the need for more efficient collaboration among multiple teams. For example, when a research team invents a new architecture like "ResNet," it is desirable for multiple engineering teams to adopt it. However, the effort required for each team to study and understand the invention does not scale well with the number of teams or inventions. In this paper, we present an extension of our PyGlove library to easily and scalably share ML ideas. PyGlove represents ideas as symbolic rule-based patches, enabling researchers to write down the rules for models they have not seen. For example, an inventor can write rules that will "add skip-connections." This permits a network effect among teams: at once, any team can issue patches to all other teams. Such a network effect allows users to quickly surmount the cost of adopting PyGlove by writing less code quicker, providing a benefit that scales with time. We describe the new paradigm of organizing ML through symbolic patches and compare it to existing approaches. We also perform a case study of a large codebase where PyGlove led to an 80% reduction in the number of lines of code.  ( 2 min )
    Fast Feature Selection with Fairness Constraints. (arXiv:2202.13718v2 [cs.LG] UPDATED)
    We study the fundamental problem of selecting optimal features for model construction. This problem is computationally challenging on large datasets, even with the use of greedy algorithm variants. To address this challenge, we extend the adaptive query model, recently proposed for the greedy forward selection for submodular functions, to the faster paradigm of Orthogonal Matching Pursuit for non-submodular functions. The proposed algorithm achieves exponentially fast parallel run time in the adaptive query model, scaling much better than prior work. Furthermore, our extension allows the use of downward-closed constraints, which can be used to encode certain fairness criteria into the feature selection process. We prove strong approximation guarantees for the algorithm based on standard assumptions. These guarantees are applicable to many parametric models, including Generalized Linear Models. Finally, we demonstrate empirically that the proposed algorithm competes favorably with state-of-the-art techniques for feature selection, on real-world and synthetic datasets.  ( 2 min )
    Fast Bayesian Optimization of Needle-in-a-Haystack Problems using Zooming Memory-Based Initialization (ZoMBI). (arXiv:2208.13771v2 [cs.LG] UPDATED)
    Needle-in-a-Haystack problems exist across a wide range of applications including rare disease prediction, ecological resource management, fraud detection, and material property optimization. A Needle-in-a-Haystack problem arises when there is an extreme imbalance of optimum conditions relative to the size of the dataset. For example, only $0.82\%$ out of $146$k total materials in the open-access Materials Project database have a negative Poisson's ratio. However, current state-of-the-art optimization algorithms are not designed with the capabilities to find solutions to these challenging multidimensional Needle-in-a-Haystack problems, resulting in slow convergence to a global optimum or pigeonholing into a local minimum. In this paper, we present a Zooming Memory-Based Initialization algorithm, entitled ZoMBI. ZoMBI actively extracts knowledge from the previously best-performing evaluated experiments to iteratively zoom in the sampling search bounds towards the global optimum "needle" and then prunes the memory of low-performing historical experiments to accelerate compute times by reducing the algorithm time complexity from $O(n^3)$ to $O(\phi^3)$ for $\phi$ forward experiments per activation, which trends to a constant $O(1)$ over several activations. Additionally, ZoMBI implements two custom adaptive acquisition functions to further guide the sampling of new experiments toward the global optimum. We validate the algorithm's optimization performance on three real-world datasets exhibiting Needle-in-a-Haystack and further stress-test the algorithm's performance on an additional 174 analytical datasets. The ZoMBI algorithm demonstrates compute time speed-ups of 400x compared to traditional Bayesian optimization as well as efficiently discovering optima in under 100 experiments that are up to 3x more highly optimized than those discovered by similar methods MiP-EGO, TuRBO, and HEBO.
    Accelerometry-based classification of circulatory states during out-of-hospital cardiac arrest. (arXiv:2205.06540v2 [eess.SP] UPDATED)
    Objective: Exploit accelerometry data for an automatic, reliable, and prompt detection of spontaneous circulation during cardiac arrest, as this is both vital for patient survival and practically challenging. Methods: We developed a machine learning algorithm to automatically predict the circulatory state during cardiopulmonary resuscitation from 4-second-long snippets of accelerometry and electrocardiogram (ECG) data from pauses of chest compressions of real-world defibrillator records. The algorithm was trained based on 422 cases from the German Resuscitation Registry, for which ground truth labels were created by a manual annotation of physicians. It uses a kernelized Support Vector Machine classifier based on 49 features, which partially reflect the correlation between accelerometry and electrocardiogram data. Results: Evaluating 50 different test-training data splits, the proposed algorithm exhibits a balanced accuracy of 81.2%, a sensitivity of 80.6%, and a specificity of 81.8%, whereas using only ECG leads to a balanced accuracy of 76.5%, a sensitivity of 80.2%, and a specificity of 72.8%. Conclusion: The first method employing accelerometry for pulse/no-pulse decision yields a significant increase in performance compared to single ECG-signal usage. Significance: This shows that accelerometry provides relevant information for pulse/no-pulse decisions. In application, such an algorithm may be used to simplify retrospective annotation for quality management and, moreover, to support clinicians to assess circulatory state during cardiac arrest treatment.
    The Solvability of Interpretability Evaluation Metrics. (arXiv:2205.08696v2 [cs.LG] UPDATED)
    Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric, which can be solved by beam search. This observation leads to the obvious yet unaddressed question: why do we use explainers (e.g., LIME) not based on solving the target metric, if the metric value represents explanation quality? We present a series of investigations showing strong performance of this beam search explainer and discuss its broader implication: a definition-evaluation duality of interpretability concepts. We implement the explainer and release the Python solvex package for models of text, image and tabular domains.  ( 2 min )
    LEAF: Navigating Concept Drift in Cellular Networks. (arXiv:2109.03011v5 [cs.NI] UPDATED)
    Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in general, but is of particular importance in networking's highly dynamic deployment environments. In this paper, we first characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. We then show that frequent model retraining with newly available data is not sufficient to mitigate concept drift, and can even degrade model accuracy further. Finally, we develop a new methodology for concept drift mitigation, Local Error Approximation of Features (LEAF). LEAF works by detecting drift; explaining the features and time intervals that contribute the most to drift; and mitigates it using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches (notably, periodic retraining) with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF consistently outperforms periodic and triggered retraining on complex, real-world data while reducing costly retraining operations.  ( 2 min )
    Online Verification of Deep Neural Networks under Domain Shift or Network Updates. (arXiv:2106.12732v2 [cs.LG] UPDATED)
    Although neural networks are widely used, it remains challenging to formally verify the safety and robustness of neural networks in real-world applications. Existing methods are designed to verify the network before deployment, which are limited to relatively simple specifications and fixed networks. These methods are not ready to be applied to real-world problems with complex and/or dynamically changing specifications and networks. To effectively handle such problems, verification needs to be performed online when these changes take place. However, it is still challenging to run existing verification algorithms online. Our key insight is that we can leverage the temporal dependencies of these changes to accelerate the verification process. This paper establishes a novel framework for scalable online verification to solve real-world verification problems with dynamically changing specifications and/or networks. We propose three types of acceleration algorithms: Branch Management to reduce repetitive computation, Perturbation Tolerance to tolerate changes, and Incremental Computation to reuse previous results. Experiment results show that our algorithms achieve up to $100\times$ acceleration, and thus show a promising way to extend neural network verification to real-world applications.  ( 2 min )
    Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality. (arXiv:2205.13521v2 [cs.AI] UPDATED)
    Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.  ( 2 min )
    Towards Optimal Branching of Linear and Semidefinite Relaxations for Neural Network Robustness Certification. (arXiv:2101.09306v2 [cs.LG] UPDATED)
    In this paper, we study certifying the robustness of ReLU neural networks against adversarial input perturbations. To diminish the relaxation error suffered by the popular linear programming (LP) and semidefinite programming (SDP) certification methods, we take a branch-and-bound approach to propose partitioning the input uncertainty set and solving the relaxations on each part separately. We show that this approach reduces relaxation error, and that the error is eliminated entirely upon performing an LP relaxation with a partition intelligently designed to exploit the nature of the ReLU activations. To scale this approach to large networks, we consider using a coarser partition whereby the number of parts in the partition is reduced. We prove that computing such a coarse partition that directly minimizes the LP relaxation error is NP-hard. By instead minimizing the worst-case LP relaxation error, we develop a closed-form branching scheme. We extend the analysis to the SDP, where the feasible set geometry is exploited to design a branching scheme that minimizes the worst-case SDP relaxation error. Experiments on MNIST, CIFAR-10, and Wisconsin breast cancer diagnosis classifiers demonstrate significant increases in the percentages of test samples certified. By independently increasing the input size and the number of layers, we empirically illustrate under which regimes the branched LP and branched SDP are best applied.  ( 2 min )
    Aligning Robot and Human Representations. (arXiv:2302.01928v1 [cs.RO])
    To act in the world, robots rely on a representation of salient task aspects: for example, to carry a cup of coffee, a robot must consider movement efficiency and cup orientation in its behaviour. However, if we want robots to act for and with people, their representations must not be just functional but also reflective of what humans care about, i.e. their representations must be aligned with humans'. In this survey, we pose that current reward and imitation learning approaches suffer from representation misalignment, where the robot's learned representation does not capture the human's representation. We suggest that because humans will be the ultimate evaluator of robot performance in the world, it is critical that we explicitly focus our efforts on aligning learned task representations with humans, in addition to learning the downstream task. We advocate that current representation learning approaches in robotics should be studied from the perspective of how well they accomplish the objective of representation alignment. To do so, we mathematically define the problem, identify its key desiderata, and situate current robot learning methods within this formalism. We conclude the survey by suggesting future directions for exploring open challenges.  ( 2 min )
    Enhancing Once-For-All: A Study on Parallel Blocks, Skip Connections and Early Exits. (arXiv:2302.01888v1 [cs.LG])
    The use of Neural Architecture Search (NAS) techniques to automate the design of neural networks has become increasingly popular in recent years. The proliferation of devices with different hardware characteristics using such neural networks, as well as the need to reduce the power consumption for their search, has led to the realisation of Once-For-All (OFA), an eco-friendly algorithm characterised by the ability to generate easily adaptable models through a single learning process. In order to improve this paradigm and develop high-performance yet eco-friendly NAS techniques, this paper presents OFAv2, the extension of OFA aimed at improving its performance while maintaining the same ecological advantage. The algorithm is improved from an architectural point of view by including early exits, parallel blocks and dense skip connections. The training process is extended by two new phases called Elastic Level and Elastic Height. A new Knowledge Distillation technique is presented to handle multi-output networks, and finally a new strategy for dynamic teacher network selection is proposed. These modifications allow OFAv2 to improve its accuracy performance on the Tiny ImageNet dataset by up to 12.07% compared to the original version of OFA, while maintaining the algorithm flexibility and advantages.  ( 2 min )
    Unsupervised hierarchical clustering using the learning dynamics of RBMs. (arXiv:2302.01851v1 [cs.LG])
    Datasets in the real world are often complex and to some degree hierarchical, with groups and sub-groups of data sharing common characteristics at different levels of abstraction. Understanding and uncovering the hidden structure of these datasets is an important task that has many practical applications. To address this challenge, we present a new and general method for building relational data trees by exploiting the learning dynamics of the Restricted Boltzmann Machine (RBM). Our method is based on the mean-field approach, derived from the Plefka expansion, and developed in the context of disordered systems. It is designed to be easily interpretable. We tested our method in an artificially created hierarchical dataset and on three different real-world datasets (images of digits, mutations in the human genome, and a homologous family of proteins). The method is able to automatically identify the hierarchical structure of the data. This could be useful in the study of homologous protein sequences, where the relationships between proteins are critical for understanding their function and evolution.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v2 [stat.ML] UPDATED)
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.  ( 2 min )
    A Case Study for Compliance as Code with Graphs and Language Models: Public release of the Regulatory Knowledge Graph. (arXiv:2302.01842v1 [cs.AI])
    The paper presents a study on using language models to automate the construction of executable Knowledge Graph (KG) for compliance. The paper focuses on Abu Dhabi Global Market regulations and taxonomy, involves manual tagging a portion of the regulations, training BERT-based models, which are then applied to the rest of the corpus. Coreference resolution and syntax analysis were used to parse the relationships between the tagged entities and to form KG stored in a Neo4j database. The paper states that the use of machine learning models released by regulators to automate the interpretation of rules is a vital step towards compliance automation, demonstrates the concept querying with Cypher, and states that the produced sub-graphs combined with Graph Neural Networks (GNN) will achieve expandability in judgment automation systems. The graph is open sourced on GitHub to provide structured data for future advancements in the field.  ( 2 min )
    Online Ad Allocation with Predictions. (arXiv:2302.01827v1 [cs.LG])
    Display Ads and the generalized assignment problem are two well-studied online packing problems with important applications in ad allocation and other areas. In both problems, ad impressions arrive online and have to be allocated immediately to budget-constrained advertisers. Worst-case algorithms that achieve the ideal competitive ratio are known, but might act overly conservative given the predictable and usually tame nature of real-world input. Given this discrepancy, we develop an algorithm for both problems that incorporate machine-learned predictions and can thus improve the performance beyond the worst-case. Our algorithm is based on the work of Feldman et al. (2009) and similar in nature to Mahdian et al. (2007) who were the first to develop a learning-augmented algorithm for the related, but more structured Ad Words problem. We use a novel analysis to show that our algorithm is able to capitalize on a good prediction, while being robust against poor predictions. We experimentally evaluate our algorithm on synthetic and real-world data on a wide range of predictions. Our algorithm is consistently outperforming the worst-case algorithm without predictions.  ( 2 min )
    From Robustness to Privacy and Back. (arXiv:2302.01855v1 [cs.LG])
    We study the relationship between two desiderata of algorithms in statistical inference and machine learning: differential privacy and robustness to adversarial data corruptions. Their conceptual similarity was first observed by Dwork and Lei (STOC 2009), who observed that private algorithms satisfy robustness, and gave a general method for converting robust algorithms to private ones. However, all general methods for transforming robust algorithms into private ones lead to suboptimal error rates. Our work gives the first black-box transformation that converts any adversarially robust algorithm into one that satisfies pure differential privacy. Moreover, we show that for any low-dimensional estimation task, applying our transformation to an optimal robust estimator results in an optimal private estimator. Thus, we conclude that for any low-dimensional task, the optimal error rate for $\varepsilon$-differentially private estimators is essentially the same as the optimal error rate for estimators that are robust to adversarially corrupting $1/\varepsilon$ training samples. We apply our transformation to obtain new optimal private estimators for several high-dimensional tasks, including Gaussian (sparse) linear regression and PCA. Finally, we present an extension of our transformation that leads to approximate differentially private algorithms whose error does not depend on the range of the output space, which is impossible under pure differential privacy.  ( 2 min )
    AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners. (arXiv:2302.01877v1 [cs.LG])
    Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data.  ( 2 min )
    PINN Training using Biobjective Optimization: The Trade-off between Data Loss and Residual Loss. (arXiv:2302.01810v1 [cs.LG])
    Physics informed neural networks (PINNs) have proven to be an efficient tool to represent problems for which measured data are available and for which the dynamics in the data are expected to follow some physical laws. In this paper, we suggest a multiobjective perspective on the training of PINNs by treating the data loss and the residual loss as two individual objective functions in a truly biobjective optimization approach. As a showcase example, we consider COVID-19 predictions in Germany and built an extended susceptibles-infected-recovered (SIR) model with additionally considered leaky-vaccinated and hospitalized populations (SVIHR model) to model the transition rates and to predict future infections. SIR-type models are expressed by systems of ordinary differential equations (ODEs). We investigate the suitability of the generated PINN for COVID-19 predictions and compare the resulting predicted curves with those obtained by applying the method of non-standard finite differences to the system of ODEs and initial data. The approach is applicable to various systems of ODEs that define dynamical regimes. Those regimes do not need to be SIR-type models, and the corresponding underlying data sets do not have to be associated with COVID-19.  ( 2 min )
    Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach. (arXiv:2302.01887v1 [cs.LG])
    Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.  ( 2 min )
    Target specific peptide design using latent space approximate trajectory collector. (arXiv:2302.01435v1 [cs.CE])
    Despite the prevalence and many successes of deep learning applications in de novo molecular design, the problem of peptide generation targeting specific proteins remains unsolved. A main barrier for this is the scarcity of the high-quality training data. To tackle the issue, we propose a novel machine learning based peptide design architecture, called Latent Space Approximate Trajectory Collector (LSATC). It consists of a series of samplers on an optimization trajectory on a highly non-convex energy landscape that approximates the distributions of peptides with desired properties in a latent space. The process involves little human intervention and can be implemented in an end-to-end manner. We demonstrate the model by the design of peptide extensions targeting Beta-catenin, a key nuclear effector protein involved in canonical Wnt signalling. When compared with a random sampler, LSATC can sample peptides with $36\%$ lower binding scores in a $16$ times smaller interquartile range (IQR) and $284\%$ less hydrophobicity with a $1.4$ times smaller IQR. LSATC also largely outperforms other common generative models. Finally, we utilized a clustering algorithm to select 4 peptides from the 100 LSATC designed peptides for experimental validation. The result confirms that all the four peptides extended by LSATC show improved Beta-catenin binding by at least $20.0\%$, and two of the peptides show a $3$ fold increase in binding affinity as compared to the base peptide.
    Machine Learning Extreme Acoustic Non-reciprocity in a Linear Waveguide with Multiple Nonlinear Asymmetric Gates. (arXiv:2302.01746v1 [eess.AS])
    This work is a study of acoustic non-reciprocity exhibited by a passive one-dimensional linear waveguide incorporating two local strongly nonlinear, asymmetric gates. Two local nonlinear gates break the symmetry and linearity of the waveguide, yielding strong global non-reciprocal acoustics, in the way that extremely different acoustical responses occur depending on the side of application of harmonic excitation. To the authors' best knowledge that the present two-gated waveguide is capable of extremely high acoustic non-reciprocity, at a much higher level to what is reported by active or passive devices in the current literature; moreover, this extreme performance combines with acceptable levels of transmissibility in the desired direction of wave propagation. Machine learning is utilized for predictive design of this gated waveguide in terms of the measures of transmissibility and non-reciprocity, with the aim of reducing the required computational time for high-dimensional parameter space analysis. The study sheds new light into the physics of these media and considers the advantages and limitations of using neural networks to analyze this type of physical problems. In the predicted desirable parameter space for intense non-reciprocity, the maximum transmissibility reaches as much as 40%, and the transmitted energy from upstream to downstream varies up to nine orders of magnitude, depending on the direction of wave transmission. The machine learning tools along with the numerical methods of this work can inform predictive designs of practical non-reciprocal waveguides and acoustic metamaterials that incorporate local nonlinear gates. The current paper shows that combinations of nonlinear gates can lead to extremely high non-reciprocity while maintaining desired levels of transmissibility.  ( 2 min )
    AIROGS: Artificial Intelligence for RObust Glaucoma Screening Challenge. (arXiv:2302.01738v1 [eess.IV])
    The early detection of glaucoma is essential in preventing visual impairment. Artificial intelligence (AI) can be used to analyze color fundus photographs (CFPs) in a cost-effective manner, making glaucoma screening more accessible. While AI models for glaucoma screening from CFPs have shown promising results in laboratory settings, their performance decreases significantly in real-world scenarios due to the presence of out-of-distribution and low-quality images. To address this issue, we propose the Artificial Intelligence for Robust Glaucoma Screening (AIROGS) challenge. This challenge includes a large dataset of around 113,000 images from about 60,000 patients and 500 different screening centers, and encourages the development of algorithms that are robust to ungradable and unexpected input data. We evaluated solutions from 14 teams in this paper, and found that the best teams performed similarly to a set of 20 expert ophthalmologists and optometrists. The highest-scoring team achieved an area under the receiver operating characteristic curve of 0.99 (95% CI: 0.98-0.99) for detecting ungradable images on-the-fly. Additionally, many of the algorithms showed robust performance when tested on three other publicly available datasets. These results demonstrate the feasibility of robust AI-enabled glaucoma screening.  ( 2 min )
    Transformers in Action Recognition: A Review on Temporal Modeling. (arXiv:2302.01921v1 [cs.CV])
    In vision-based action recognition, spatio-temporal features from different modalities are used for recognizing activities. Temporal modeling is a long challenge of action recognition. However, there are limited methods such as pre-computed motion features, three-dimensional (3D) filters, and recurrent neural networks (RNN) for modeling motion information in deep-based approaches. Recently, transformers success in modeling long-range dependencies in natural language processing (NLP) tasks has gotten great attention from other domains; including speech, image, and video, to rely entirely on self-attention without using sequence-aligned RNNs or convolutions. Although the application of transformers to action recognition is relatively new, the amount of research proposed on this topic within the last few years is astounding. This paper especially reviews recent progress in deep learning methods for modeling temporal variations. It focuses on action recognition methods that use transformers for temporal modeling, discussing their main features, used modalities, and identifying opportunities and challenges for future research.
    Leveraging weak complementary labels to improve semantic segmentation of hepatocellular carcinoma and cholangiocarcinoma in H&E-stained slides. (arXiv:2302.01813v1 [cs.CV])
    In this paper, we present a deep learning segmentation approach to classify and quantify the two most prevalent primary liver cancers - hepatocellular carcinoma and intrahepatic cholangiocarcinoma - from hematoxylin and eosin (H&E) stained whole slide images. While semantic segmentation of medical images typically requires costly pixel-level annotations by domain experts, there often exists additional information which is routinely obtained in clinical diagnostics but rarely utilized for model training. We propose to leverage such weak information from patient diagnoses by deriving complementary labels that indicate to which class a sample cannot belong to. To integrate these labels, we formulate a complementary loss for segmentation. Motivated by the medical application, we demonstrate for general segmentation tasks that including additional patches with solely weak complementary labels during model training can significantly improve the predictive performance and robustness of a model. On the task of diagnostic differentiation between hepatocellular carcinoma and intrahepatic cholangiocarcinoma, we achieve a balanced accuracy of 0.91 (CI 95%: 0.86 - 0.95) at case level for 165 hold-out patients. Furthermore, we also show that leveraging complementary labels improves the robustness of segmentation and increases performance at case level.  ( 2 min )
    Coinductive guide to inductive transformer heads. (arXiv:2302.01834v1 [cs.LG])
    We argue that all building blocks of transformer models can be expressed with a single concept: combinatorial Hopf algebra. Transformer learning emerges as a result of the subtle interplay between the algebraic and coalgebraic operations of the combinatorial Hopf algebra. Viewed through this lens, the transformer model becomes a linear time-invariant system where the attention mechanism computes a generalized convolution transform and the residual stream serves as a unit impulse. Attention-only transformers then learn by enforcing an invariant between these two paths. We call this invariant Hopf coherence. Due to this, with a degree of poetic license, one could call combinatorial Hopf algebras "tensors with a built-in loss function gradient". This loss function gradient occurs within the single layers and no backward pass is needed. This is in contrast to automatic differentiation which happens across the whole graph and needs a explicit backward pass. This property is the result of the fact that combinatorial Hopf algebras have the surprising property of calculating eigenvalues by repeated squaring.  ( 2 min )
    Learning finite difference methods for reaction-diffusion type equations with FCNN. (arXiv:2201.01854v2 [cs.LG] UPDATED)
    In recent years, Physics-informed neural networks (PINNs) have been widely used to solve partial differential equations alongside numerical methods because PINNs can be trained without observations and deal with continuous-time problems directly. In contrast, optimizing the parameters of such models is difficult, and individual training sessions must be performed to predict the evolutions of each different initial condition. To alleviate the first problem, observed data can be injected directly into the loss function part. To solve the second problem, a network architecture can be built as a framework to learn a finite difference method. In view of the two motivations, we propose Five-point stencil CNNs (FCNNs) containing a five-point stencil kernel and a trainable approximation function for reaction-diffusion type equations including the heat, Fisher's, Allen-Cahn, and other reaction-diffusion equations with trigonometric function terms. We show that FCNNs can learn finite difference schemes using few data and achieve the low relative errors of diverse reaction-diffusion evolutions with unseen initial conditions. Furthermore, we demonstrate that FCNNs can still be trained well even with using noisy data.
    From slides (through tiles) to pixels: an explainability framework for weakly supervised models in pre-clinical pathology. (arXiv:2302.01653v1 [cs.CV])
    In pre-clinical pathology, there is a paradox between the abundance of raw data (whole slide images from many organs of many individual animals) and the lack of pixel-level slide annotations done by pathologists. Due to time constraints and requirements from regulatory authorities, diagnoses are instead stored as slide labels. Weakly supervised training is designed to take advantage of those data, and the trained models can be used by pathologists to rank slides by their probability of containing a given lesion of interest. In this work, we propose a novel contextualized eXplainable AI (XAI) framework and its application to deep learning models trained on Whole Slide Images (WSIs) in Digital Pathology. Specifically, we apply our methods to a multi-instance-learning (MIL) model, which is trained solely on slide-level labels, without the need for pixel-level annotations. We validate quantitatively our methods by quantifying the agreements of our explanations' heatmaps with pathologists' annotations, as well as with predictions from a segmentation model trained on such annotations. We demonstrate the stability of the explanations with respect to input shifts, and the fidelity with respect to increased model performance. We quantitatively evaluate the correlation between available pixel-wise annotations and explainability heatmaps. We show that the explanations on important tiles of the whole slide correlate with tissue changes between healthy regions and lesions, but do not exactly behave like a human annotator. This result is coherent with the model training strategy.  ( 2 min )
    Fixing by Mixing: A Recipe for Optimal Byzantine ML under Heterogeneity. (arXiv:2302.01772v1 [cs.LG])
    Byzantine machine learning (ML) aims to ensure the resilience of distributed learning algorithms to misbehaving (or Byzantine) machines. Although this problem received significant attention, prior works often assume the data held by the machines to be homogeneous, which is seldom true in practical settings. Data heterogeneity makes Byzantine ML considerably more challenging, since a Byzantine machine can hardly be distinguished from a non-Byzantine outlier. A few solutions have been proposed to tackle this issue, but these provide suboptimal probabilistic guarantees and fare poorly in practice. This paper closes the theoretical gap, achieving optimality and inducing good empirical results. In fact, we show how to automatically adapt existing solutions for (homogeneous) Byzantine ML to the heterogeneous setting through a powerful mechanism, we call nearest neighbor mixing (NNM), which boosts any standard robust distributed gradient descent variant to yield optimal Byzantine resilience under heterogeneity. We obtain similar guarantees (in expectation) by plugging NNM in the distributed stochastic heavy ball method, a practical substitute to distributed gradient descent. We obtain empirical results that significantly outperform state-of-the-art Byzantine ML solutions.  ( 2 min )
    Avalanche: A PyTorch Library for Deep Continual Learning. (arXiv:2302.01766v1 [cs.LG])
    Continual learning is the problem of learning from a nonstationary stream of data, a fundamental issue for sustainable and efficient training of deep neural networks over time. Unfortunately, deep learning libraries only provide primitives for offline training, assuming that model's architecture and data are fixed. Avalanche is an open source library maintained by the ContinualAI non-profit organization that extends PyTorch by providing first-class support for dynamic architectures, streams of datasets, and incremental training and evaluation methods. Avalanche provides a large set of predefined benchmarks and training algorithms and it is easy to extend and modular while supporting a wide range of continual learning scenarios. Documentation is available at \url{https://avalanche.continualai.org}.  ( 2 min )
    Certified Robustness of Learning-based Static Malware Detectors. (arXiv:2302.01757v1 [cs.CR])
    Certified defenses are a recent development in adversarial machine learning (ML), which aim to rigorously guarantee the robustness of ML models to adversarial perturbations. A large body of work studies certified defenses in computer vision, where $\ell_p$ norm-bounded evasion attacks are adopted as a tractable threat model. However, this threat model has known limitations in vision, and is not applicable to other domains -- e.g., where inputs may be discrete or subject to complex constraints. Motivated by this gap, we study certified defenses for malware detection, a domain where attacks against ML-based systems are a real and current threat. We consider static malware detection systems that operate on byte-level data. Our certified defense is based on the approach of randomized smoothing which we adapt by: (1) replacing the standard Gaussian randomization scheme with a novel deletion randomization scheme that operates on bytes or chunks of an executable; and (2) deriving a certificate that measures robustness to evasion attacks in terms of generalized edit distance. To assess the size of robustness certificates that are achievable while maintaining high accuracy, we conduct experiments on malware datasets using a popular convolutional malware detection model, MalConv. We are able to accurately classify 91% of the inputs while being certifiably robust to any adversarial perturbations of edit distance 128 bytes or less. By comparison, an existing certification of up to 128 bytes of substitutions (without insertions or deletions) achieves an accuracy of 78%. In addition, given that robustness certificates are conservative, we evaluate practical robustness to several recently published evasion attacks and, in some cases, find robustness beyond certified guarantees.  ( 2 min )
    Mind the Gap: Offline Policy Optimization for Imperfect Rewards. (arXiv:2302.01667v1 [cs.LG])
    Reward function is essential in reinforcement learning (RL), serving as the guiding signal to incentivize agents to solve given tasks, however, is also notoriously difficult to design. In many cases, only imperfect rewards are available, which inflicts substantial performance loss for RL agents. In this study, we propose a unified offline policy optimization approach, \textit{RGM (Reward Gap Minimization)}, which can smartly handle diverse types of imperfect rewards. RGM is formulated as a bi-level optimization problem: the upper layer optimizes a reward correction term that performs visitation distribution matching w.r.t. some expert data; the lower layer solves a pessimistic RL problem with the corrected rewards. By exploiting the duality of the lower layer, we derive a tractable algorithm that enables sampled-based learning without any online interactions. Comprehensive experiments demonstrate that RGM achieves superior performance to existing methods under diverse settings of imperfect rewards. Further, RGM can effectively correct wrong or inconsistent rewards against expert preference and retrieve useful information from biased rewards.  ( 2 min )
    BackdoorBox: A Python Toolbox for Backdoor Learning. (arXiv:2302.01762v1 [cs.CR])
    Third-party resources ($e.g.$, samples, backbones, and pre-trained models) are usually involved in the training of deep neural networks (DNNs), which brings backdoor attacks as a new training-phase threat. In general, backdoor attackers intend to implant hidden backdoor in DNNs, so that the attacked DNNs behave normally on benign samples whereas their predictions will be maliciously changed to a pre-defined target label if hidden backdoors are activated by attacker-specified trigger patterns. To facilitate the research and development of more secure training schemes and defenses, we design an open-sourced Python toolbox that implements representative and advanced backdoor attacks and defenses under a unified and flexible framework. Our toolbox has four important and promising characteristics, including consistency, simplicity, flexibility, and co-development. It allows researchers and developers to easily implement and compare different methods on benchmark or their local datasets. This Python toolbox, namely \texttt{BackdoorBox}, is available at \url{https://github.com/THUYimingLi/BackdoorBox}.  ( 2 min )
    Creating Probabilistic Forecasts from Arbitrary Deterministic Forecasts using Conditional Invertible Neural Networks. (arXiv:2302.01800v1 [cs.LG])
    In various applications, probabilistic forecasts are required to quantify the inherent uncertainty associated with the forecast. However, numerous modern forecasting methods are still designed to create deterministic forecasts. Transforming these deterministic forecasts into probabilistic forecasts is often challenging and based on numerous assumptions that may not hold in real-world situations. Therefore, the present article proposes a novel approach for creating probabilistic forecasts from arbitrary deterministic forecasts. In order to implement this approach, we use a conditional Invertible Neural Network (cINN). More specifically, we apply a cINN to learn the underlying distribution of the data and then combine the uncertainty from this distribution with an arbitrary deterministic forecast to generate accurate probabilistic forecasts. Our approach enables the simple creation of probabilistic forecasts without complicated statistical loss functions or further assumptions. Besides showing the mathematical validity of our approach, we empirically show that our approach noticeably outperforms traditional methods for including uncertainty in deterministic forecasts and generally outperforms state-of-the-art probabilistic forecasting benchmarks.  ( 2 min )
    Leveraging a Probabilistic PCA Model to Understand the Multivariate Statistical Network Monitoring Framework for Network Security Anomaly Detection. (arXiv:2302.01759v1 [stat.ML])
    Network anomaly detection is a very relevant research area nowadays, especially due to its multiple applications in the field of network security. The boost of new models based on variational autoencoders and generative adversarial networks has motivated a reevaluation of traditional techniques for anomaly detection. It is, however, essential to be able to understand these new models from the perspective of the experience attained from years of evaluating network security data for anomaly detection. In this paper, we revisit anomaly detection techniques based on PCA from a probabilistic generative model point of view, and contribute a mathematical model that relates them. Specifically, we start with the probabilistic PCA model and explain its connection to the Multivariate Statistical Network Monitoring (MSNM) framework. MSNM was recently successfully proposed as a means of incorporating industrial process anomaly detection experience into the field of networking. We have evaluated the mathematical model using two different datasets. The first, a synthetic dataset created to better understand the analysis proposed, and the second, UGR'16, is a specifically designed real-traffic dataset for network security anomaly detection. We have drawn conclusions that we consider to be useful when applying generative models to network security detection.
    DEUP: Direct Epistemic Uncertainty Prediction. (arXiv:2102.08501v4 [cs.LG] UPDATED)
    Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.
    Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers. (arXiv:2302.01925v1 [cs.LG])
    We propose a new class of linear Transformers called FourierLearner-Transformers (FLTs), which incorporate a wide range of relative positional encoding mechanisms (RPEs). These include regular RPE techniques applied for nongeometric data, as well as novel RPEs operating on the sequences of tokens embedded in higher-dimensional Euclidean spaces (e.g. point clouds). FLTs construct the optimal RPE mechanism implicitly by learning its spectral representation. As opposed to other architectures combining efficient low-rank linear attention with RPEs, FLTs remain practical in terms of their memory usage and do not require additional assumptions about the structure of the RPE-mask. FLTs allow also for applying certain structural inductive bias techniques to specify masking strategies, e.g. they provide a way to learn the so-called local RPEs introduced in this paper and providing accuracy gains as compared with several other linear Transformers for language modeling. We also thoroughly tested FLTs on other data modalities and tasks, such as: image classification and 3D molecular modeling. For 3D-data FLTs are, to the best of our knowledge, the first Transformers architectures providing RPE-enhanced linear attention.
    Show me your NFT and I tell you how it will perform: Multimodal representation learning for NFT selling price prediction. (arXiv:2302.01676v1 [cs.LG])
    Non-Fungible Tokens (NFTs) represent deeds of ownership, based on blockchain technologies and smart contracts, of unique crypto assets on digital art forms (e.g., artworks or collectibles). In the spotlight after skyrocketing in 2021, NFTs have attracted the attention of crypto enthusiasts and investors intent on placing promising investments in this profitable market. However, the NFT financial performance prediction has not been widely explored to date. In this work, we address the above problem based on the hypothesis that NFT images and their textual descriptions are essential proxies to predict the NFT selling prices. To this purpose, we propose MERLIN, a novel multimodal deep learning framework designed to train Transformer-based language and visual models, along with graph neural network models, on collections of NFTs' images and texts. A key aspect in MERLIN is its independence on financial features, as it exploits only the primary data a user interested in NFT trading would like to deal with, i.e., NFT images and textual descriptions. By learning dense representations of such data, a price-category classification task is performed by MERLIN models, which can also be tuned according to user preferences in the inference phase to mimic different risk-return investment profiles. Experimental evaluation on a publicly available dataset has shown that MERLIN models achieve significant performances according to several financial assessment criteria, fostering profitable investments, and also beating baseline machine-learning classifiers based on financial features.  ( 2 min )
    Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. (arXiv:2302.01771v1 [stat.ML])
    Deep learning (DL) has emerged as a promising tool to downscale climate projections at regional-to-local scales from large-scale atmospheric fields following the perfect-prognosis (PP) approach. Given their complexity, it is crucial to properly evaluate these methods, especially when applied to changing climatic conditions where the ability to extrapolate/generalise is key. In this work, we intercompare several DL models extracted from the literature for the same challenging use-case (downscaling temperature in the CORDEX North America domain) and expand standard evaluation methods building on eXplainable artifical intelligence (XAI) techniques. We show how these techniques can be used to unravel the internal behaviour of these models, providing new evaluation dimensions and aiding in their diagnostic and design. These results show the usefulness of incorporating XAI techniques into statistical downscaling evaluation frameworks, especially when working with large regions and/or under climate change conditions.
    Stochastic Policy Gradient Methods: Improved Sample Complexity for Fisher-non-degenerate Policies. (arXiv:2302.01734v1 [cs.LG])
    Recently, the impressive empirical success of policy gradient (PG) methods has catalyzed the development of their theoretical foundations. Despite the huge efforts directed at the design of efficient stochastic PG-type algorithms, the understanding of their convergence to a globally optimal policy is still limited. In this work, we develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies which allows to address the case of continuous state action spaces. First, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $\tilde{\mathcal{O}}(\varepsilon^{-2.5})$ sample complexity of this method for finding a global $\varepsilon$-optimal policy. Improving over the previously known $\tilde{\mathcal{O}}(\varepsilon^{-3})$ complexity, this algorithm does not require the use of importance sampling or second-order information and samples only one trajectory per iteration. Second, we further improve this complexity to $\tilde{ \mathcal{\mathcal{O}} }(\varepsilon^{-2})$ by considering a Hessian-Aided Recursive Policy Gradient ((N)-HARPG) algorithm enhanced with a correction based on a Hessian-vector product. Interestingly, both algorithms are $(i)$ simple and easy to implement: single-loop, do not require large batches of trajectories and sample at most two trajectories per iteration; $(ii)$ computationally and memory efficient: they do not require expensive subroutines at each iteration and can be implemented with memory linear in the dimension of parameters.
    Reinforcing User Retention in a Billion Scale Short Video Recommender System. (arXiv:2302.01724v1 [cs.LG])
    Recently, short video platforms have achieved rapid user growth by recommending interesting content to users. The objective of the recommendation is to optimize user retention, thereby driving the growth of DAU (Daily Active Users). Retention is a long-term feedback after multiple interactions of users and the system, and it is hard to decompose retention reward to each item or a list of items. Thus traditional point-wise and list-wise models are not able to optimize retention. In this paper, we choose reinforcement learning methods to optimize the retention as they are designed to maximize the long-term performance. We formulate the problem as an infinite-horizon request-based Markov Decision Process, and our objective is to minimize the accumulated time interval of multiple sessions, which is equal to improving the app open frequency and user retention. However, current reinforcement learning algorithms can not be directly applied in this setting due to uncertainty, bias, and long delay time incurred by the properties of user retention. We propose a novel method, dubbed RLUR, to address the aforementioned challenges. Both offline and live experiments show that RLUR can significantly improve user retention. RLUR has been fully launched in Kuaishou app for a long time, and achieves consistent performance improvement on user retention and DAU.  ( 2 min )
    Distributional constrained reinforcement learning for supply chain optimization. (arXiv:2302.01727v1 [cs.LG])
    This work studies reinforcement learning (RL) in the context of multi-period supply chains subject to constraints, e.g., on production and inventory. We introduce Distributional Constrained Policy Optimization (DCPO), a novel approach for reliable constraint satisfaction in RL. Our approach is based on Constrained Policy Optimization (CPO), which is subject to approximation errors that in practice lead it to converge to infeasible policies. We address this issue by incorporating aspects of distributional RL into DCPO. Specifically, we represent the return and cost value functions using neural networks that output discrete distributions, and we reshape costs based on the associated confidence. Using a supply chain case study, we show that DCPO improves the rate at which the RL policy converges and ensures reliable constraint satisfaction by the end of training. The proposed method also improves predictability, greatly reducing the variance of returns between runs, respectively; this result is significant in the context of policy gradient methods, which intrinsically introduce significant variance during training.  ( 2 min )
    Interpretations of Domain Adaptations via Layer Variational Analysis. (arXiv:2302.01798v1 [cs.LG])
    Transfer learning is known to perform efficiently in many applications empirically, yet limited literature reports the mechanism behind the scene. This study establishes both formal derivations and heuristic analysis to formulate the theory of transfer learning in deep learning. Our framework utilizing layer variational analysis proves that the success of transfer learning can be guaranteed with corresponding data conditions. Moreover, our theoretical calculation yields intuitive interpretations towards the knowledge transfer process. Subsequently, an alternative method for network-based transfer learning is derived. The method shows an increase in efficiency and accuracy for domain adaptation. It is particularly advantageous when new domain data is sufficiently sparse during adaptation. Numerical experiments over diverse tasks validated our theory and verified that our analytic expression achieved better performance in domain adaptation than the gradient descent method.  ( 2 min )
    Command Line Interface Risk Modeling. (arXiv:2302.01749v1 [cs.CR])
    Protecting sensitive data is an essential part of security in cloud computing. However, only specific privileged individuals have access to view or interact with this data; therefore, it is unscalable to depend on these individuals also to maintain the software. A solution to this is to allow non-privileged individuals access to maintain these systems but mask sensitive information from egressing. To this end, we have created a machine-learning model to predict and redact fields with sensitive data. This work concentrates on Azure PowerShell, showing how it applies to other command-line interfaces and APIs. Using the F5-score as a weighted metric, we demonstrate different transformation techniques to map this problem from an unknown field to the well-researched area of natural language processing.  ( 2 min )
    Motion ID: Human Authentication Approach. (arXiv:2302.01751v1 [cs.CR])
    We introduce a novel approach to user authentication called Motion ID. The method employs motion sensing provided by inertial measurement units (IMUs), using it to verify the persons identity via short time series of IMU data captured by the mobile device. The paper presents two labeled datasets with unlock events: the first features IMU measurements, provided by six users who continuously collected data on six different smartphones for a period of 12 weeks. The second one contains 50 hours of IMU data for one specific motion pattern, provided by 101 users. Moreover, we present a two-stage user authentication process that employs motion pattern identification and user verification and is based on data preprocessing and machine learning. The Results section details the assessment of the method proposed, comparing it with existing biometric authentication methods and the Android biometric standard. The method has demonstrated high accuracy, indicating that it could be successfully used in combination with existing methods. Furthermore, the method exhibits significant promise as a standalone solution. We provide the datasets to the scholarly community and share our project code.  ( 2 min )
    Leveraging Contaminated Datasets to Learn Clean-Data Distribution with Purified Generative Adversarial Networks. (arXiv:2302.01722v1 [cs.LG])
    Generative adversarial networks (GANs) are known for their strong abilities on capturing the underlying distribution of training instances. Since the seminal work of GAN, many variants of GAN have been proposed. However, existing GANs are almost established on the assumption that the training dataset is clean. But in many real-world applications, this may not hold, that is, the training dataset may be contaminated by a proportion of undesired instances. When training on such datasets, existing GANs will learn a mixture distribution of desired and contaminated instances, rather than the desired distribution of desired data only (target distribution). To learn the target distribution from contaminated datasets, two purified generative adversarial networks (PuriGAN) are developed, in which the discriminators are augmented with the capability to distinguish between target and contaminated instances by leveraging an extra dataset solely composed of contamination instances. We prove that under some mild conditions, the proposed PuriGANs are guaranteed to converge to the distribution of desired instances. Experimental results on several datasets demonstrate that the proposed PuriGANs are able to generate much better images from the desired distribution than comparable baselines when trained on contaminated datasets. In addition, we also demonstrate the usefulness of PuriGAN on downstream applications by applying it to the tasks of semi-supervised anomaly detection on contaminated datasets and PU-learning. Experimental results show that PuriGAN is able to deliver the best performance over comparable baselines on both tasks.  ( 2 min )
    Better Training of GFlowNets with Local Credit and Incomplete Trajectories. (arXiv:2302.01687v1 [cs.LG])
    Generative Flow Networks or GFlowNets are related to Monte-Carlo Markov chain methods (as they sample from a distribution specified by an energy function), reinforcement learning (as they learn a policy to sample composed objects through a sequence of steps), generative models (as they learn to represent and sample from a distribution) and amortized variational methods (as they can be used to learn to approximate and sample from an otherwise intractable posterior, given a prior and a likelihood). They are trained to generate an object $x$ through a sequence of steps with probability proportional to some reward function $R(x)$ (or $\exp(-\mathcal{E}(x))$ with $\mathcal{E}(x)$ denoting the energy function), given at the end of the generative trajectory. Like for other RL settings where the reward is only given at the end, the efficiency of training and credit assignment may suffer when those trajectories are longer. With previous GFlowNet work, no learning was possible from incomplete trajectories (lacking a terminal state and the computation of the associated reward). In this paper, we consider the case where the energy function can be applied not just to terminal states but also to intermediate states. This is for example achieved when the energy function is additive, with terms available along the trajectory. We show how to reparameterize the GFlowNet state flow function to take advantage of the partial reward already accrued at each state. This enables a training objective that can be applied to update parameters even with incomplete trajectories. Even when complete trajectories are available, being able to obtain more localized credit and gradients is found to speed up training convergence, as demonstrated across many simulations.  ( 2 min )
    Learning End-to-End Channel Coding with Diffusion Models. (arXiv:2302.01714v1 [cs.IT])
    It is a known problem that deep-learning-based end-to-end (E2E) channel coding systems depend on a known and differentiable channel model, due to the learning process and based on the gradient-descent optimization methods. This places the challenge to approximate or generate the channel or its derivative from samples generated by pilot signaling in real-world scenarios. Currently, there are two prevalent methods to solve this problem. One is to generate the channel via a generative adversarial network (GAN), and the other is to, in essence, approximate the gradient via reinforcement learning methods. Other methods include using score-based methods, variational autoencoders, or mutual-information-based methods. In this paper, we focus on generative models and, in particular, on a new promising method called diffusion models, which have shown a higher quality of generation in image-based tasks. We will show that diffusion models can be used in wireless E2E scenarios and that they work as good as Wasserstein GANs while having a more stable training procedure and a better generalization ability in testing.  ( 2 min )
    Where and How to Improve Graph-based Spatio-temporal Predictors. (arXiv:2302.01701v1 [stat.ML])
    This paper introduces a novel residual correlation analysis, called AZ-analysis, to assess the optimality of spatio-temporal predictive models. The proposed AZ-analysis constitutes a valuable asset for discovering and highlighting those space-time regions where the model can be improved with respect to performance. The AZ-analysis operates under very mild assumptions and is based on a spatio-temporal graph that encodes serial and functional dependencies in the data; asymptotically distribution-free summary statistics identify existing residual correlation in space and time regions, hence localizing time frames and/or communities of sensors, where the predictor can be improved.  ( 2 min )
    Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective. (arXiv:2302.01735v1 [cs.CV])
    For medical image segmentation, contrastive learning is the dominant practice to improve the quality of visual representations by contrasting semantically similar and dissimilar pairs of samples. This is enabled by the observation that without accessing ground truth label, negative examples with truly dissimilar anatomical features, if sampled, can significantly improve the performance. In reality, however, these samples may come from similar anatomical features and the models may struggle to distinguish the minority tail-class samples, making the tail classes more prone to misclassification, both of which typically lead to model collapse. In this paper, we propose ARCO, a semi-supervised contrastive learning (CL) framework with stratified group sampling theory in medical image segmentation. In particular, we first propose building ARCO through the concept of variance-reduced estimation, and show that certain variance-reduction techniques are particularly beneficial in medical image segmentation tasks with extremely limited labels. Furthermore, we theoretically prove these sampling techniques are universal in variance reduction. Finally, we experimentally validate our approaches on three benchmark datasets with different label settings, and our methods consistently outperform state-of-the-art semi- and fully-supervised methods. Additionally, we augment the CL frameworks with these sampling techniques and demonstrate significant gains over previous methods. We believe our work is an important step towards semi-supervised medical image segmentation by quantifying the limitation of current self-supervision objectives for accomplishing medical image analysis tasks.  ( 2 min )
    A Systematic Evaluation of Backdoor Trigger Characteristics in Image Classification. (arXiv:2302.01740v1 [cs.CV])
    Deep learning achieves outstanding results in many machine learning tasks. Nevertheless, it is vulnerable to backdoor attacks that modify the training set to embed a secret functionality in the trained model. The modified training samples have a secret property, i.e., a trigger. At inference time, the secret functionality is activated when the input contains the trigger, while the model functions correctly in other cases. While there are many known backdoor attacks (and defenses), deploying a stealthy attack is still far from trivial. Successfully creating backdoor triggers heavily depends on numerous parameters. Unfortunately, research has not yet determined which parameters contribute most to the attack performance. This paper systematically analyzes the most relevant parameters for the backdoor attacks, i.e., trigger size, position, color, and poisoning rate. Using transfer learning, which is very common in computer vision, we evaluate the attack on numerous state-of-the-art models (ResNet, VGG, AlexNet, and GoogLeNet) and datasets (MNIST, CIFAR10, and TinyImageNet). Our attacks cover the majority of backdoor settings in research, providing concrete directions for future works. Our code is publicly available to facilitate the reproducibility of our results.  ( 2 min )
    Improving the Timing Resolution of Positron Emission Tomography Detectors using Boosted Learning -- A Residual Physics Approach. (arXiv:2302.01681v1 [cs.LG])
    Artificial intelligence is finding its way into medical imaging, usually focusing on image reconstruction or enhancing analytical reconstructed images. However, optimizations along the complete processing chain, from detecting signals to computing data, enable significant improvements. Thus, we present an approach toward detector optimization using boosted learning by exploiting the concept of residual physics. In our work, we improve the coincidence time resolution (CTR) of positron emission tomography (PET) detectors. PET enables imaging of metabolic processes by detecting {\gamma}-photons with scintillation detectors. Current research exploits light-sharing detectors, where the scintillation light is distributed over and digitized by an array of readout channels. While these detectors demonstrate excellent performance parameters, e.g., regarding spatial resolution, extracting precise timing information for time-of-flight (TOF) becomes more challenging due to deteriorating effects called time skews. Conventional correction methods mainly rely on analytical formulations, theoretically capable of covering all time skew effects, e.g., caused by signal runtimes or physical effects. However, additional effects are involved for light-sharing detectors, so finding suitable analytical formulations can become arbitrarily complicated. The residual physics-based strategy uses gradient tree boosting (GTB) and a physics-informed data generation mimicking an actual imaging process by shifting a radiation source. We used clinically relevant detectors with a height of 19 mm, coupled to digital photosensor arrays. All trained models improved the CTR significantly. Using the best model, we achieved CTRs down to 198 ps (185 ps) for energies ranging from 300 keV to 700 keV (450 keV to 550 keV).  ( 2 min )
    Structure-informed Language Models Are Protein Designers. (arXiv:2302.01649v1 [cs.LG])
    This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that our approach outperforms the state-of-the-art methods by a large margin, leading to up to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65% and 56.63% on CATH 4.2 and 4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins)  ( 2 min )
    Two-Stage Constrained Actor-Critic fo Short Video Recommendation. (arXiv:2302.01680v1 [cs.LG])
    The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.  ( 2 min )
    GTV: Generating Tabular Data via Vertical Federated Learning. (arXiv:2302.01706v1 [cs.LG])
    Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.  ( 2 min )
    Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks. (arXiv:2302.01677v1 [cs.LG])
    In this work, besides improving prediction accuracy, we study whether personalization could bring robustness benefits to backdoor attacks. We conduct the first study of backdoor attacks in the pFL framework, testing 4 widely used backdoor attacks against 6 pFL methods on benchmark datasets FEMNIST and CIFAR-10, a total of 600 experiments. The study shows that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks. In contrast, pFL methods with full model-sharing do not show robustness. To analyze the reasons for varying robustness performances, we provide comprehensive ablation studies on different pFL methods. Based on our findings, we further propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks. We believe that our work could provide both guidance for pFL application in terms of its robustness and offer valuable insights to design more robust FL methods in the future.  ( 2 min )
    Private, fair and accurate: Training large-scale, privacy-preserving AI models in radiology. (arXiv:2302.01622v1 [eess.IV])
    Artificial intelligence (AI) models are increasingly used in the medical domain. However, as medical data is highly sensitive, special precautions to ensure the protection of said data are required. The gold standard for privacy preservation is the introduction of differential privacy (DP) to model training. However, prior work has shown that DP has negative implications on model accuracy and fairness. Therefore, the purpose of this study is to demonstrate that the privacy-preserving training of AI models for chest radiograph diagnosis is possible with high accuracy and fairness compared to non-private training. N=193,311 high quality clinical chest radiographs were retrospectively collected and manually labeled by experienced radiologists, who assigned one or more of the following diagnoses: cardiomegaly, congestion, pleural effusion, pneumonic infiltration and atelectasis, to each side (where applicable). The non-private AI models were compared with privacy-preserving (DP) models with respect to privacy-utility trade-offs (measured as area under the receiver-operator-characteristic curve (AUROC)), and privacy-fairness trade-offs (measured as Pearson-R or Statistical Parity Difference). The non-private AI model achieved an average AUROC score of 0.90 over all labels, whereas the DP AI model with a privacy budget of epsilon=7.89 resulted in an AUROC of 0.87, i.e., a mere 2.6% performance decrease compared to non-private training. The privacy-preserving training of diagnostic AI models can achieve high performance with a small penalty on model accuracy and does not amplify discrimination against age, sex or co-morbidity. We thus encourage practitioners to integrate state-of-the-art privacy-preserving techniques into medical AI model development.  ( 2 min )
    A Feature Selection Method for Driver Stress Detection Using Heart Rate Variability and Breathing Rate. (arXiv:2302.01602v1 [cs.LG])
    Driver stress is a major cause of car accidents and death worldwide. Furthermore, persistent stress is a health problem, contributing to hypertension and other diseases of the cardiovascular system. Stress has a measurable impact on heart and breathing rates and stress levels can be inferred from such measurements. Galvanic skin response is a common test to measure the perspiration caused by both physiological and psychological stress, as well as extreme emotions. In this paper, galvanic skin response is used to estimate the ground truth stress levels. A feature selection technique based on the minimal redundancy-maximal relevance method is then applied to multiple heart rate variability and breathing rate metrics to identify a novel and optimal combination for use in detecting stress. The support vector machine algorithm with a radial basis function kernel was used along with these features to reliably predict stress. The proposed method has achieved a high level of accuracy on the target dataset.  ( 2 min )
    Convergence Analysis of Split Learning on Non-IID Data. (arXiv:2302.01633v1 [cs.LG])
    Split Learning (SL) is one promising variant of Federated Learning (FL), where the AI model is split and trained at the clients and the server collaboratively. By offloading the computation-intensive portions to the server, SL enables efficient model training on resource-constrained clients. Despite its booming applications, SL still lacks rigorous convergence analysis on non-IID data, which is critical for hyperparameter selection. In this paper, we first prove that SL exhibits an $\mathcal{O}(1/\sqrt{R})$ convergence rate for non-convex objectives on non-IID data, where $R$ is the number of total training rounds. The derived convergence results can facilitate understanding the effect of some crucial factors in SL (e.g., data heterogeneity and synchronization interval). Furthermore, comparing with the convergence result of FL, we show that the guarantee of SL is worse than FL in terms of training rounds on non-IID data. The experimental results verify our theory. More findings on the comparison between FL and SL in cross-device settings are also reported.  ( 2 min )
    Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (arXiv:2302.01629v1 [stat.ML])
    Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK). We prove that, for random features, the model is not robust for any degree of over-parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. In contrast, for even activations, the NTK model meets the universal lower bound, and it is robust as soon as the necessary condition on over-parameterization is fulfilled. This also addresses a conjecture in prior work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel of the model from an "interaction matrix", which describes the interaction with the test data and captures the effect of the activation. Our theoretical results are corroborated by numerical evidence on both synthetic and standard datasets (MNIST, CIFAR-10).  ( 2 min )
    SCCAM: Supervised Contrastive Convolutional Attention Mechanism for Ante-hoc Interpretable Fault Diagnosis with Limited Fault Samples. (arXiv:2302.01599v1 [cs.LG])
    In real industrial processes, fault diagnosis methods are required to learn from limited fault samples since the procedures are mainly under normal conditions and the faults rarely occur. Although attention mechanisms have become popular in the field of fault diagnosis, the existing attention-based methods are still unsatisfying for the above practical applications. First, pure attention-based architectures like transformers need a large number of fault samples to offset the lack of inductive biases thus performing poorly under limited fault samples. Moreover, the poor fault classification dilemma further leads to the failure of the existing attention-based methods to identify the root causes. To address the aforementioned issues, we innovatively propose a supervised contrastive convolutional attention mechanism (SCCAM) with ante-hoc interpretability, which solves the root cause analysis problem under limited fault samples for the first time. The proposed SCCAM method is tested on a continuous stirred tank heater and the Tennessee Eastman industrial process benchmark. Three common fault diagnosis scenarios are covered, including a balanced scenario for additional verification and two scenarios with limited fault samples (i.e., imbalanced scenario and long-tail scenario). The comprehensive results demonstrate that the proposed SCCAM method can achieve better performance compared with the state-of-the-art methods on fault classification and root cause analysis.  ( 2 min )
    A Novel Fuzzy Bi-Clustering Algorithm with AFS for Identification of Co-Regulated Genes. (arXiv:2302.01596v1 [cs.LG])
    The identification of co-regulated genes and their transcription-factor binding sites (TFBS) are the key steps toward understanding transcription regulation. In addition to effective laboratory assays, various bi-clustering algorithms for detection of the co-expressed genes have been developed. Bi-clustering methods are used to discover subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. By building two fuzzy partition matrices of the gene expression data with the Axiomatic Fuzzy Set (AFS) theory, this paper proposes a novel fuzzy bi-clustering algorithm for identification of co-regulated genes. Specifically, the gene expression data is transformed into two fuzzy partition matrices via sub-preference relations theory of AFS at first. One of the matrices is considering the genes as the universe and the conditions as the concept, the other one is considering the genes as the concept and the conditions as the universe. The identification of the co-regulated genes (bi-clusters) is carried out on the two partition matrices at the same time. Then, a novel fuzzy-based similarity criterion is defined based on the partition matrixes, and a cyclic optimization algorithm is designed to discover the significant bi-clusters at expression level. The above procedures guarantee that the generated bi-clusters have more significant expression values than that of extracted by the traditional bi-clustering methods. Finally, the performance of the proposed method is evaluated with the performance of the three well-known bi-clustering algorithms on publicly available real microarray datasets. The experimental results are in agreement with the theoretical analysis and show that the proposed algorithm can effectively detect the co-regulated genes without any prior knowledge of the gene expression data.  ( 2 min )
    Learning to Decouple Complex Systems. (arXiv:2302.01581v1 [cs.LG])
    A complex system with cluttered observations may be a coupled mixture of multiple simple sub-systems corresponding to latent entities. Such sub-systems may hold distinct dynamics in the continuous-time domain; therein, complicated interactions between sub-systems also evolve over time. This setting is fairly common in the real world but has been less considered. In this paper, we propose a sequential learning approach under this setting by decoupling a complex system for handling irregularly sampled and cluttered sequential observations. Such decoupling brings about not only subsystems describing the dynamics of each latent entity but also a meta-system capturing the interaction between entities over time. Specifically, we argue that the meta-system evolving within a simplex is governed by projected differential equations (ProjDEs). We further analyze and provide neural-friendly projection operators in the context of Bregman divergence. Experimental results on synthetic and real-world datasets show the advantages of our approach when facing complex and cluttered sequential data compared to the state-of-the-art.  ( 2 min )
    Deep Reinforcement Learning for Cyber System Defense under Dynamic Adversarial Uncertainties. (arXiv:2302.01595v1 [cs.LG])
    Development of autonomous cyber system defense strategies and action recommendations in the real-world is challenging, and includes characterizing system state uncertainties and attack-defense dynamics. We propose a data-driven deep reinforcement learning (DRL) framework to learn proactive, context-aware, defense countermeasures that dynamically adapt to evolving adversarial behaviors while minimizing loss of cyber system operations. A dynamic defense optimization problem is formulated with multiple protective postures against different types of adversaries with varying levels of skill and persistence. A custom simulation environment was developed and experiments were devised to systematically evaluate the performance of four model-free DRL algorithms against realistic, multi-stage attack sequences. Our results suggest the efficacy of DRL algorithms for proactive cyber defense under multi-stage attack profiles and system uncertainties.  ( 2 min )
    Blockwise Self-Supervised Learning at Scale. (arXiv:2302.01647v1 [cs.CV])
    Current state-of-the-art deep networks are all powered by backpropagation. In this paper, we explore alternatives to full backpropagation in the form of blockwise learning rules, leveraging the latest developments in self-supervised learning. We show that a blockwise pretraining procedure consisting of training independently the 4 main blocks of layers of a ResNet-50 with Barlow Twins' loss function at each block performs almost as well as end-to-end backpropagation on ImageNet: a linear probe trained on top of our blockwise pretrained model obtains a top-1 classification accuracy of 70.48%, only 1.1% below the accuracy of an end-to-end pretrained network (71.57% accuracy). We perform extensive experiments to understand the impact of different components within our method and explore a variety of adaptations of self-supervised learning to the blockwise paradigm, building an exhaustive understanding of the critical avenues for scaling local learning rules to large networks, with implications ranging from hardware design to neuroscience.  ( 2 min )
    An Operational Perspective to Fairness Interventions: Where and How to Intervene. (arXiv:2302.01574v1 [cs.LG])
    As AI-based decision systems proliferate, their successful operationalization requires balancing multiple desiderata: predictive performance, disparity across groups, safeguarding sensitive group attributes (e.g., race), and engineering cost. We present a holistic framework for evaluating and contextualizing fairness interventions with respect to the above desiderata. The two key points of practical consideration are where (pre-, in-, post-processing) and how (in what way the sensitive group data is used) the intervention is introduced. We demonstrate our framework using a thorough benchmarking study on predictive parity; we study close to 400 methodological variations across two major model types (XGBoost vs. Neural Net) and ten datasets. Methodological insights derived from our empirical study inform the practical design of ML workflow with fairness as a central concern. We find predictive parity is difficult to achieve without using group data, and despite requiring group data during model training (but not inference), distributionally robust methods provide significant Pareto improvement. Moreover, a plain XGBoost model often Pareto-dominates neural networks with fairness interventions, highlighting the importance of model inductive bias.  ( 2 min )
    ResMem: Learn what you can and memorize the rest. (arXiv:2302.01576v1 [cs.LG])
    The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.  ( 2 min )
    Uniform tensor clustering by jointly exploring sample affinities of various orders. (arXiv:2302.01569v1 [cs.LG])
    Conventional clustering methods based on pairwise affinity usually suffer from the concentration effect while processing huge dimensional features yet low sample sizes data, resulting in inaccuracy to encode the sample proximity and suboptimal performance in clustering. To address this issue, we propose a unified tensor clustering method (UTC) that characterizes sample proximity using multiple samples' affinity, thereby supplementing rich spatial sample distributions to boost clustering. Specifically, we find that the triadic tensor affinity can be constructed via the Khari-Rao product of two affinity matrices. Furthermore, our early work shows that the fourth-order tensor affinity is defined by the Kronecker product. Therefore, we utilize arithmetical products, Khatri-Rao and Kronecker products, to mathematically integrate different orders of affinity into a unified tensor clustering framework. Thus, the UTC jointly learns a joint low-dimensional embedding to combine various orders. Finally, a numerical scheme is designed to solve the problem. Experiments on synthetic datasets and real-world datasets demonstrate that 1) the usage of high-order tensor affinity could provide a supplementary characterization of sample proximity to the popular affinity matrix; 2) the proposed method of UTC is affirmed to enhance clustering by exploiting different order affinities when processing high-dimensional data.  ( 2 min )
    DynaMIX: Resource Optimization for DNN-Based Real-Time Applications on a Multi-Tasking System. (arXiv:2302.01568v1 [cs.LG])
    As deep neural networks (DNNs) prove their importance and feasibility, more and more DNN-based apps, such as detection and classification of objects, have been developed and deployed on autonomous vehicles (AVs). To meet their growing expectations and requirements, AVs should "optimize" use of their limited onboard computing resources for multiple concurrent in-vehicle apps while satisfying their timing requirements (especially for safety). That is, real-time AV apps should share the limited on-board resources with other concurrent apps without missing their deadlines dictated by the frame rate of a camera that generates and provides input images to the apps. However, most, if not all, of existing DNN solutions focus on enhancing the concurrency of their specific hardware without dynamically optimizing/modifying the DNN apps' resource requirements, subject to the number of running apps, owing to their high computational cost. To mitigate this limitation, we propose DynaMIX (Dynamic MIXed-precision model construction), which optimizes the resource requirement of concurrent apps and aims to maximize execution accuracy. To realize a real-time resource optimization, we formulate an optimization problem using app performance profiles to consider both the accuracy and worst-case latency of each app. We also propose dynamic model reconfiguration by lazy loading only the selected layers at runtime to reduce the overhead of loading the entire model. DynaMIX is evaluated in terms of constraint satisfaction and inference accuracy for a multi-tasking system and compared against state-of-the-art solutions, demonstrating its effectiveness and feasibility under various environmental/operating conditions.  ( 2 min )
    Multiplier Bootstrap-based Exploration. (arXiv:2302.01543v1 [cs.LG])
    Despite the great interest in the bandit problem, designing efficient algorithms for complex models remains challenging, as there is typically no analytical way to quantify uncertainty. In this paper, we propose Multiplier Bootstrap-based Exploration (MBE), a novel exploration strategy that is applicable to any reward model amenable to weighted loss minimization. We prove both instance-dependent and instance-independent rate-optimal regret bounds for MBE in sub-Gaussian multi-armed bandits. With extensive simulation and real data experiments, we show the generality and adaptivity of MBE.  ( 2 min )
    Example-Based Explainable AI and its Application for Remote Sensing Image Classification. (arXiv:2302.01526v1 [cs.AI])
    We present a method of explainable artificial intelligence (XAI), "What I Know (WIK)", to provide additional information to verify the reliability of a deep learning model by showing an example of an instance in a training dataset that is similar to the input data to be inferred and demonstrate it in a remote sensing image classification task. One of the expected roles of XAI methods is verifying whether inferences of a trained machine learning model are valid for an application, and it is an important factor that what datasets are used for training the model as well as the model architecture. Our data-centric approach can help determine whether the training dataset is sufficient for each inference by checking the selected example data. If the selected example looks similar to the input data, we can confirm that the model was not trained on a dataset with a feature distribution far from the feature of the input data. With this method, the criteria for selecting an example are not merely data similarity with the input data but also data similarity in the context of the model task. Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated with reasonably selected examples. This method can be applied to various machine-learning tasks, including classification and regression.  ( 2 min )
    DCM: Deep energy method based on the principle of minimum complementary energy. (arXiv:2302.01538v1 [cs.LG])
    The principle of minimum potential and complementary energy are the most important variational principles in solid mechanics. The deep energy method (DEM), which has received much attention, is based on the principle of minimum potential energy and lacks the important form of minimum complementary energy. Thus, we propose the deep energy method based on the principle of minimum complementary energy (DCM). The output function of DCM is the stress function that naturally satisfies the equilibrium equation. We extend the proposed DCM algorithm (DCM-P), adding the terms that naturally satisfy the biharmonic equation in the Airy stress function. We combine operator learning with physical equations and propose a deep complementary energy operator method (DCM-O), including branch net, trunk net, basis net, and particular net. DCM-O first combines existing high-fidelity numerical results to train DCM-O through data. Then the complementary energy is used to train the branch net and trunk net in DCM-O. To analyze DCM performance, we present the numerical result of the most common stress functions, the Prandtl and Airy stress function. The proposed method DCM is used to model the representative mechanical problems with the different types of boundary conditions. We compare DCM with the existing PINNs and DEM algorithms. The result shows the advantage of the proposed DCM is suitable for dealing with problems of dominated displacement boundary conditions, which is reflected in theory and our numerical experiments. DCM-P and DCM-O improve the accuracy of DCM and the speed of calculation convergence. DCM is an essential supplementary energy form of the deep energy method. We believe that operator learning based on the energy method can balance data and physical equations well, giving computational mechanics broad research prospects.  ( 2 min )
    Robust Camera Pose Refinement for Multi-Resolution Hash Encoding. (arXiv:2302.01571v1 [cs.CV])
    Multi-resolution hash encoding has recently been proposed to reduce the computational cost of neural renderings, such as NeRF. This method requires accurate camera poses for the neural renderings of given scenes. However, contrary to previous methods jointly optimizing camera poses and 3D scenes, the naive gradient-based camera pose refinement method using multi-resolution hash encoding severely deteriorates performance. We propose a joint optimization algorithm to calibrate the camera pose and learn a geometric representation using efficient multi-resolution hash encoding. Showing that the oscillating gradient flows of hash encoding interfere with the registration of camera poses, our method addresses the issue by utilizing smooth interpolation weighting to stabilize the gradient oscillation for the ray samplings across hash grids. Moreover, the curriculum training procedure helps to learn the level-wise hash encoding, further increasing the pose refinement. Experiments on the novel-view synthesis datasets validate that our learning frameworks achieve state-of-the-art performance and rapid convergence of neural rendering, even when initial camera poses are unknown.  ( 2 min )
    Machine Learning for UAV Propeller Fault Detection based on a Hybrid Data Generation Model. (arXiv:2302.01556v1 [cs.LG])
    This paper describes the development of an on-board data-driven system that can monitor and localize the fault in a quadrotor unmanned aerial vehicle (UAV) and at the same time, evaluate the degree of damage of the fault under real scenarios. To achieve offline training data generation, a hybrid approach is proposed for the development of a virtual data-generative model using a combination of data-driven models as well as well-established dynamic models that describe the kinematics of the UAV. To effectively represent the drop in performance of a faulty propeller, a variation of the deep neural network, a LSTM network is proposed. With the RPM of the propeller as input and based on the fault condition of the propeller, the proposed propeller model estimates the resultant torque and thrust. Then, flight datasets of the UAV under various fault scenarios are generated via simulation using the developed data-generative model. Lastly, a fault classifier using a CNN model is proposed to identify as well as evaluate the degree of damage to the damaged propeller. The scope of this paper focuses on the identification of faulty propellers and classification of the fault level for quadrotor UAVs using RPM as well as flight data. Doing so allows for early minor fault detection to prevent serious faults from occurring if the fault is left unrepaired. To further validate the workability of this approach outside of simulation, a real-flight test is conducted indoors. The real flight data is collected and a simulation to real sim-real test is conducted. Due to the imperfections in the build of our experimental UAV, a slight calibration approach to our simulation model is further proposed and the experimental results obtained show that our trained model can identify the location of propeller fault as well as the degree/type of damage. Currently, the diagnosis accuracy on the testing set is over 80%.  ( 3 min )
    Deep Reinforcement Learning for Online Error Detection in Cyber-Physical Systems. (arXiv:2302.01567v1 [cs.LG])
    Reliability is one of the major design criteria in Cyber-Physical Systems (CPSs). This is because of the existence of some critical applications in CPSs and their failure is catastrophic. Therefore, employing strong error detection and correction mechanisms in CPSs is inevitable. CPSs are composed of a variety of units, including sensors, networks, and microcontrollers. Each of these units is probable to be in a faulty state at any time and the occurred fault can result in erroneous output. The fault may cause the units of CPS to malfunction and eventually crash. Traditional fault-tolerant approaches include redundancy time, hardware, information, and/or software. However, these approaches impose significant overheads besides their low error coverage, which limits their applicability. In addition, the interval between error occurrence and detection is too long in these approaches. In this paper, based on Deep Reinforcement Learning (DRL), a new error detection approach is proposed that not only detects errors with high accuracy but also can perform error detection at the moment due to very low inference time. The proposed approach can categorize different types of errors from normal data and predict whether the system will fail. The evaluation results illustrate that the proposed approach has improved more than 2x in terms of accuracy and more than 5x in terms of inference time compared to other approaches.  ( 2 min )
    Ordered GNN: Ordering Message Passing to Deal with Heterophily and Over-smoothing. (arXiv:2302.01524v1 [cs.LG])
    Most graph neural networks follow the message passing mechanism. However, it faces the over-smoothing problem when multiple times of message passing is applied to a graph, causing indistinguishable node representations and prevents the model to effectively learn dependencies between farther-away nodes. On the other hand, features of neighboring nodes with different labels are likely to be falsely mixed, resulting in the heterophily problem. In this work, we propose to order the messages passing into the node representation, with specific blocks of neurons targeted for message passing within specific hops. This is achieved by aligning the hierarchy of the rooted-tree of a central node with the ordered neurons in its node representation. Experimental results on an extensive set of datasets show that our model can simultaneously achieve the state-of-the-art in both homophily and heterophily settings, without any targeted design. Moreover, its performance maintains pretty well while the model becomes really deep, effectively preventing the over-smoothing problem. Finally, visualizing the gating vectors shows that our model learns to behave differently between homophily and heterophily settings, providing an explainable graph neural model.  ( 2 min )
    Causal Inference Based Single-branch Ensemble Trees For Uplift Modeling. (arXiv:2302.01563v1 [cs.LG])
    In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET. Different from standard classification methods for predictive probability modeling, CIET aims to achieve the change in the predictive probability of outcome caused by an action or a treatment. According to our CIET, two partition criteria are specifically designed to maximize the difference in outcome distribution between the treatment and control groups. Next, a novel single-branch tree is built by taking a top-down node partition approach, and the remaining samples are censored since they are not covered by the upper node partition logic. Repeating the tree-building process on the censored data, single-branch ensemble trees with a set of inference rules are thus formed. Moreover, CIET is experimentally demonstrated to outperform previous approaches for uplift modeling in terms of both area under uplift curve (AUUC) and Qini coefficient significantly. At present, CIET has already been applied to online personal loans in a national financial holdings group in China. CIET will also be of use to analysts applying machine learning techniques to causal inference in broader business domains such as web advertising, medicine and economics.  ( 2 min )
    Group Fairness in Non-monotone Submodular Maximization. (arXiv:2302.01546v1 [cs.LG])
    Maximizing a submodular function has a wide range of applications in machine learning and data mining. One such application is data summarization whose goal is to select a small set of representative and diverse data items from a large dataset. However, data items might have sensitive attributes such as race or gender, in this setting, it is important to design \emph{fairness-aware} algorithms to mitigate potential algorithmic bias that may cause over- or under- representation of particular groups. Motivated by that, we propose and study the classic non-monotone submodular maximization problem subject to novel group fairness constraints. Our goal is to select a set of items that maximizes a non-monotone submodular function, while ensuring that the number of selected items from each group is proportionate to its size, to the extent specified by the decision maker. We develop the first constant-factor approximation algorithms for this problem. We also extend the basic model to incorporate an additional global size constraint on the total number of selected items.  ( 2 min )
    Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits. (arXiv:2302.01544v1 [cs.LG])
    In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.  ( 2 min )
    Vertical Federated Learning: Taxonomies, Threats, and Prospects. (arXiv:2302.01550v1 [cs.LG])
    Federated learning (FL) is the most popular distributed machine learning technique. FL allows machine-learning models to be trained without acquiring raw data to a single point for processing. Instead, local models are trained with local data; the models are then shared and combined. This approach preserves data privacy as locally trained models are shared instead of the raw data themselves. Broadly, FL can be divided into horizontal federated learning (HFL) and vertical federated learning (VFL). For the former, different parties hold different samples over the same set of features; for the latter, different parties hold different feature data belonging to the same set of samples. In a number of practical scenarios, VFL is more relevant than HFL as different companies (e.g., bank and retailer) hold different features (e.g., credit history and shopping history) for the same set of customers. Although VFL is an emerging area of research, it is not well-established compared to HFL. Besides, VFL-related studies are dispersed, and their connections are not intuitive. Thus, this survey aims to bring these VFL-related studies to one place. Firstly, we classify existing VFL structures and algorithms. Secondly, we present the threats from security and privacy perspectives to VFL. Thirdly, for the benefit of future researchers, we discussed the challenges and prospects of VFL in detail.  ( 2 min )
    Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective. (arXiv:2302.01530v1 [cs.CL])
    Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at https://github.com/jongwooko/CR-ILD.  ( 2 min )
    Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19. (arXiv:2302.01536v1 [cs.CL])
    To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.  ( 3 min )
    Multi-channel Autobidding with Budget and ROI Constraints. (arXiv:2302.01523v1 [cs.GT])
    In digital online advertising, advertisers procure ad impressions simultaneously on multiple platforms, or so-called channels, such as Google Ads, Meta Ads Manager, etc., each of which consists of numerous ad auctions. We study how an advertiser maximizes total conversion (e.g. ad clicks) while satisfying aggregate return-on-investment (ROI) and budget constraints across all channels. In practice, an advertiser does not have control over, and thus cannot globally optimize, which individual ad auctions she participates in for each channel, and instead authorizes a channel to procure impressions on her behalf: the advertiser can only utilize two levers on each channel, namely setting a per-channel budget and per-channel target ROI. In this work, we first analyze the effectiveness of each of these levers for solving the advertiser's global multi-channel problem. We show that when an advertiser only optimizes over per-channel ROIs, her total conversion can be arbitrarily worse than what she could have obtained in the global problem. Further, we show that the advertiser can achieve the global optimal conversion when she only optimizes over per-channel budgets. In light of this finding, under a bandit feedback setting that mimics real-world scenarios where advertisers have limited information on ad auctions in each channels and how channels procure ads, we present an efficient learning algorithm that produces per-channel budgets whose resulting conversion approximates that of the global optimal problem. Finally, we argue that all our results hold for both single-item and multi-item auctions from which channels procure impressions on advertisers' behalf.  ( 2 min )
    Pseudonorm Approachability and Applications to Regret Minimization. (arXiv:2302.01517v1 [cs.LG])
    Blackwell's celebrated approachability theory provides a general framework for a variety of learning problems, including regret minimization. However, Blackwell's proof and implicit algorithm measure approachability using the $\ell_2$ (Euclidean) distance. We argue that in many applications such as regret minimization, it is more useful to study approachability under other distance metrics, most commonly the $\ell_\infty$-metric. But, the time and space complexity of the algorithms designed for $\ell_\infty$-approachability depend on the dimension of the space of the vectorial payoffs, which is often prohibitively large. Thus, we present a framework for converting high-dimensional $\ell_\infty$-approachability problems to low-dimensional pseudonorm approachability problems, thereby resolving such issues. We first show that the $\ell_\infty$-distance between the average payoff and the approachability set can be equivalently defined as a pseudodistance between a lower-dimensional average vector payoff and a new convex set we define. Next, we develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $\ell_2$ and other norms, showing that it can be achieved via online linear optimization (OLO) over a convex set given by the Fenchel dual of the unit pseudonorm ball. We then use that to show, modulo mild normalization assumptions, that there exists an $\ell_\infty$-approachability algorithm whose convergence is independent of the dimension of the original vectorial payoff. We further show that that algorithm admits a polynomial-time complexity, assuming that the original $\ell_\infty$-distance can be computed efficiently. We also give an $\ell_\infty$-approachability algorithm whose convergence is logarithmic in that dimension using an FTRL algorithm with a maximum-entropy regularizer.  ( 2 min )
    Improving Recommendation Relevance by simulating User Interest. (arXiv:2302.01522v1 [math.NA])
    Most if not all on-line item-to-item recommendation systems rely on estimation of a distance like measure (rank) of similarity between items. For on-line recommendation systems, time sensitivity of this similarity measure is extremely important. We observe that recommendation "recency" can be straightforwardly and transparently maintained by iterative reduction of ranks of inactive items. The paper briefly summarizes algorithmic developments based on this self-explanatory observation. The basic idea behind this work is patented in a context of online recommendation systems.  ( 2 min )
    Randomized Gaussian Process Upper Confidence Bound with Tight Bayesian Regret Bounds. (arXiv:2302.01511v1 [cs.LG])
    Gaussian process upper confidence bound (GP-UCB) is a theoretically promising approach for black-box optimization; however, the confidence parameter $\beta$ is considerably large in the theorem and chosen heuristically in practice. Then, randomized GP-UCB (RGP-UCB) uses a randomized confidence parameter, which follows the Gamma distribution, to mitigate the impact of manually specifying $\beta$. This study first generalizes the regret analysis of RGP-UCB to a wider class of distributions, including the Gamma distribution. Furthermore, we propose improved RGP-UCB (IRGP-UCB) based on a two-parameter exponential distribution, which achieves tight Bayesian regret bounds. IRGP-UCB does not require an increase in the confidence parameter in terms of the number of iterations, which avoids over-exploration in the later iterations. Finally, we demonstrate the effectiveness of IRGP-UCB through extensive experiments.  ( 2 min )
    A Lipschitz Bandits Approach for Continuous Hyperparameter Optimization. (arXiv:2302.01539v1 [cs.LG])
    One of the most critical problems in machine learning is HyperParameter Optimization (HPO), since choice of hyperparameters has a significant impact on final model performance. Although there are many HPO algorithms, they either have no theoretical guarantees or require strong assumptions. To this end, we introduce BLiE -- a Lipschitz-bandit-based algorithm for HPO that only assumes Lipschitz continuity of the objective function. BLiE exploits the landscape of the objective function to adaptively search over the hyperparameter space. Theoretically, we show that $(i)$ BLiE finds an $\epsilon$-optimal hyperparameter with $O \left( \frac{1}{\epsilon} \right)^{d_z + \beta}$ total budgets, where $d_z$ and $\beta$ are problem intrinsic; $(ii)$ BLiE is highly parallelizable. Empirically, we demonstrate that BLiE outperforms the state-of-the-art HPO algorithms on benchmark tasks. We also apply BLiE to search for noise schedule of diffusion models. Comparison with the default schedule shows that BLiE schedule greatly improves the sampling speed.  ( 2 min )
    Support Recovery in Sparse PCA with Non-Random Missing Data. (arXiv:2302.01535v1 [stat.ML])
    We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.  ( 2 min )
    ANTM: An Aligned Neural Topic Model for Exploring Evolving Topics. (arXiv:2302.01501v1 [cs.IR])
    As the amount of text data generated by humans and machines increases, the necessity of understanding large corpora and finding a way to extract insights from them is becoming more crucial than ever. Dynamic topic models are effective methods that primarily focus on studying the evolution of topics present in a collection of documents. These models are widely used for understanding trends, exploring public opinion in social networks, or tracking research progress and discoveries in scientific archives. Since topics are defined as clusters of semantically similar documents, it is necessary to observe the changes in the content or themes of these clusters in order to understand how topics evolve as new knowledge is discovered over time. In this paper, we introduce the Aligned Neural Topic Model (ANTM), a dynamic neural topic model that uses document embeddings to compute clusters of semantically similar documents at different periods and to align document clusters to represent their evolution. This alignment procedure preserves the temporal similarity of document clusters over time and captures the semantic change of words characterized by their context within different periods. Experiments on four different datasets show that ANTM outperforms probabilistic dynamic topic models (e.g. DTM, DETM) and significantly improves topic coherence and diversity over other existing dynamic neural topic models (e.g. BERTopic).  ( 2 min )
    LSA-PINN: Linear Boundary Connectivity Loss for Solving PDEs on Complex Geometry. (arXiv:2302.01518v1 [cs.LG])
    We present a novel loss formulation for efficient learning of complex dynamics from governing physics, typically described by partial differential equations (PDEs), using physics-informed neural networks (PINNs). In our experiments, existing versions of PINNs are seen to learn poorly in many problems, especially for complex geometries, as it becomes increasingly difficult to establish appropriate sampling strategy at the near boundary region. Overly dense sampling can adversely impede training convergence if the local gradient behaviors are too complex to be adequately modelled by PINNs. On the other hand, if the samples are too sparse, existing PINNs tend to overfit the near boundary region, leading to incorrect solution. To prevent such issues, we propose a new Boundary Connectivity (BCXN) loss function which provides linear local structure approximation (LSA) to the gradient behaviors at the boundary for PINN. Our BCXN-loss implicitly imposes local structure during training, thus facilitating fast physics-informed learning across entire problem domains with order of magnitude sparser training samples. This LSA-PINN method shows a few orders of magnitude smaller errors than existing methods in terms of the standard L2-norm metric, while using dramatically fewer training samples and iterations. Our proposed LSA-PINN does not pose any requirement on the differentiable property of the networks, and we demonstrate its benefits and ease of implementation on both multi-layer perceptron and convolutional neural network versions as commonly used in current PINN literature.  ( 2 min )
    Xtal2DoS: Attention-based Crystal to Sequence Learning for Density of States Prediction. (arXiv:2302.01486v1 [cs.LG])
    Modern machine learning techniques have been extensively applied to materials science, especially for property prediction tasks. A majority of these methods address scalar property predictions, while more challenging spectral properties remain less emphasized. We formulate a crystal-to-sequence learning task and propose a novel attention-based learning method, Xtal2DoS, which decodes the sequential representation of the material density of states (DoS) properties by incorporating the learned atomic embeddings through attention networks. Experiments show Xtal2DoS is faster than the existing models, and consistently outperforms other state-of-the-art methods on four metrics for two fundamental spectral properties, phonon and electronic DoS.  ( 2 min )
    User-centric Heterogeneous-action Deep Reinforcement Learning for Virtual Reality in the Metaverse over Wireless Networks. (arXiv:2302.01471v1 [cs.NI])
    The Metaverse is emerging as maturing technologies are empowering the different facets. Virtual Reality (VR) technologies serve as the backbone of the virtual universe within the Metaverse to offer a highly immersive user experience. As mobility is emphasized in the Metaverse context, VR devices reduce their weights at the sacrifice of local computation abilities. In this paper, for a system consisting of a Metaverse server and multiple VR users, we consider two cases of (i) the server generating frames and transmitting them to users, and (ii) users generating frames locally and thus consuming device energy. Moreover, in our multi-user VR scenario for the Metaverse, users have different characteristics and demands for Frames Per Second (FPS). Then the channel access arrangement (including the decisions on frame generation location), and transmission powers for the downlink communications from the server to the users are jointly optimized to improve the utilities of users. This joint optimization is addressed by deep reinforcement learning (DRL) with heterogeneous actions. Our proposed user-centric DRL algorithm is called User-centric Critic with Heterogenous Actors (UCHA). Extensive experiments demonstrate that our UCHA algorithm leads to remarkable results under various requirements and constraints.  ( 2 min )
    Clustered Embedding Learning for Recommender Systems. (arXiv:2302.01478v1 [cs.AI])
    In recent years, recommender systems have advanced rapidly, where embedding learning for users and items plays a critical role. A standard method learns a unique embedding vector for each user and item. However, such a method has two important limitations in real-world applications: 1) it is hard to learn embeddings that generalize well for users and items with rare interactions on their own; and 2) it may incur unbearably high memory costs when the number of users and items scales up. Existing approaches either can only address one of the limitations or have flawed overall performances. In this paper, we propose Clustered Embedding Learning (CEL) as an integrated solution to these two problems. CEL is a plug-and-play embedding learning framework that can be combined with any differentiable feature interaction model. It is capable of achieving improved performance, especially for cold users and items, with reduced memory cost. CEL enables automatic and dynamic clustering of users and items in a top-down fashion, where clustered entities jointly learn a shared embedding. The accelerated version of CEL has an optimal time complexity, which supports efficient online updates. Theoretically, we prove the identifiability and the existence of a unique optimal number of clusters for CEL in the context of nonnegative matrix factorization. Empirically, we validate the effectiveness of CEL on three public datasets and one business dataset, showing its consistently superior performance against current state-of-the-art methods. In particular, when incorporating CEL into the business model, it brings an improvement of $+0.6\%$ in AUC, which translates into a significant revenue gain; meanwhile, the size of the embedding table gets $2650$ times smaller.  ( 2 min )
    Learning to Optimize for Reinforcement Learning. (arXiv:2302.01470v1 [cs.LG])
    In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning optimization, outperforming classical hand-designed optimizers. However, in practice, these learned optimizers fail to generalize to reinforcement learning tasks due to unstable and complex loss landscapes. Moreover, neither hand-designed optimizers nor learned optimizers have been specifically designed to address the unique optimization properties in reinforcement learning. In this work, we take a data-driven approach to learn to optimize for reinforcement learning using meta-learning. We introduce a novel optimizer structure that significantly improves the training efficiency of learned optimizers, making it possible to learn an optimizer for reinforcement learning from scratch. Although trained in toy tasks, our learned optimizer demonstrates its generalization ability to unseen complex tasks. Finally, we design a set of small gridworlds to train the first general-purpose optimizer for reinforcement learning.  ( 2 min )
    Towards Practical Preferential Bayesian Optimization with Skew Gaussian Processes. (arXiv:2302.01513v1 [cs.LG])
    We study preferential Bayesian optimization (BO) where reliable feedback is limited to pairwise comparison called duels. An important challenge in preferential BO, which uses the preferential Gaussian process (GP) model to represent flexible preference structure, is that the posterior distribution is a computationally intractable skew GP. The most widely used approach for preferential BO is Gaussian approximation, which ignores the skewness of the true posterior. Alternatively, Markov chain Monte Carlo (MCMC) based preferential BO is also proposed. In this work, we first verify the accuracy of Gaussian approximation, from which we reveal the critical problem that the predictive probability of duels can be inaccurate. This observation motivates us to improve the MCMC-based estimation for skew GP, for which we show the practical efficiency of Gibbs sampling and derive the low variance MC estimator. However, the computational time of MCMC can still be a bottleneck in practice. Towards building a more practical preferential BO, we develop a new method that achieves both high computational efficiency and low sample complexity, and then demonstrate its effectiveness through extensive numerical experiments.  ( 2 min )
    Perfect Is the Enemy of Test Oracle. (arXiv:2302.01488v1 [cs.SE])
    Automation of test oracles is one of the most challenging facets of software testing, but remains comparatively less addressed compared to automated test input generation. Test oracles rely on a ground-truth that can distinguish between the correct and buggy behavior to determine whether a test fails (detects a bug) or passes. What makes the oracle problem challenging and undecidable is the assumption that the ground-truth should know the exact expected, correct, or buggy behavior. However, we argue that one can still build an accurate oracle without knowing the exact correct or buggy behavior, but how these two might differ. This paper presents SEER, a learning-based approach that in the absence of test assertions or other types of oracle, can determine whether a unit test passes or fails on a given method under test (MUT). To build the ground-truth, SEER jointly embeds unit tests and the implementation of MUTs into a unified vector space, in such a way that the neural representation of tests are similar to that of MUTs they pass on them, but dissimilar to MUTs they fail on them. The classifier built on top of this vector representation serves as the oracle to generate "fail" labels, when test inputs detect a bug in MUT or "pass" labels, otherwise. Our extensive experiments on applying SEER to more than 5K unit tests from a diverse set of open-source Java projects show that the produced oracle is (1) effective in predicting the fail or pass labels, achieving an overall accuracy, precision, recall, and F1 measure of 93%, 86%, 94%, and 90%, (2) generalizable, predicting the labels for the unit test of projects that were not in training or validation set with negligible performance drop, and (3) efficient, detecting the existence of bugs in only 6.5 milliseconds on average.  ( 3 min )
    LazyGNN: Large-Scale Graph Neural Networks via Lazy Propagation. (arXiv:2302.01503v1 [cs.LG])
    Recent works have demonstrated the benefits of capturing long-distance dependency in graphs by deeper graph neural networks (GNNs). But deeper GNNs suffer from the long-lasting scalability challenge due to the neighborhood explosion problem in large-scale graphs. In this work, we propose to capture long-distance dependency in graphs by shallower models instead of deeper models, which leads to a much more efficient model, LazyGNN, for graph representation learning. Moreover, we demonstrate that LazyGNN is compatible with existing scalable approaches (such as sampling methods) for further accelerations through the development of mini-batch LazyGNN. Comprehensive experiments demonstrate its superior prediction performance and scalability on large-scale benchmarks. LazyGNN also achieves state-of-art performance on the OGB leaderboard.  ( 2 min )
    Defensive ML: Defending Architectural Side-channels with Adversarial Obfuscation. (arXiv:2302.01474v1 [cs.CR])
    Side-channel attacks that use machine learning (ML) for signal analysis have become prominent threats to computer security, as ML models easily find patterns in signals. To address this problem, this paper explores using Adversarial Machine Learning (AML) methods as a defense at the computer architecture layer to obfuscate side channels. We call this approach Defensive ML, and the generator to obfuscate signals, defender. Defensive ML is a workflow to design, implement, train, and deploy defenders for different environments. First, we design a defender architecture given the physical characteristics and hardware constraints of the side-channel. Next, we use our DefenderGAN structure to train the defender. Finally, we apply defensive ML to thwart two side-channel attacks: one based on memory contention and the other on application power. The former uses a hardware defender with ns-level response time that attains a high level of security with half the performance impact of a traditional scheme; the latter uses a software defender with ms-level response time that provides better security than a traditional scheme with only 70% of its power overhead.  ( 2 min )
    Gradient Estimation for Unseen Domain Risk Minimization with Pre-Trained Models. (arXiv:2302.01497v1 [cs.LG])
    Domain generalization aims to build generalized models that perform well on unseen domains when only source domains are available for model optimization. Recent studies have demonstrated that large-scale pre-trained models could play an important role in domain generalization by providing their generalization power. However, large-scale pre-trained models are not fully equipped with target task-specific knowledge due to a discrepancy between the pre-training objective and the target task. Although the task-specific knowledge could be learned from source domains by fine-tuning, this hurts the generalization power of the pre-trained models because of gradient bias toward the source domains. To address this issue, we propose a new domain generalization method that estimates unobservable gradients that reduce potential risks in unseen domains, using a large-scale pre-trained model. Our proposed method allows the pre-trained model to learn task-specific knowledge further while preserving its generalization ability with the estimated gradients. Experimental results show that our proposed method outperforms baseline methods on DomainBed, a standard benchmark in domain generalization. We also provide extensive analyses to demonstrate that the estimated unobserved gradients relieve the gradient bias, and the pre-trained model learns the task-specific knowledge without sacrificing its generalization power.  ( 2 min )
    SPADE: Self-supervised Pretraining for Acoustic DisEntanglement. (arXiv:2302.01483v1 [cs.LG])
    Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.  ( 2 min )
    Efficient Domain Adaptation for Speech Foundation Models. (arXiv:2302.01496v1 [cs.CL])
    Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.  ( 2 min )
    Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms. (arXiv:2302.01450v1 [cs.LG])
    Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, often discounted reward formulations are used with the discount factor being close to 1, which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first non-trivial error bounds for average-reward MDPs which go to zero in the limit where when policy evaluation and policy improvement errors go to zero.  ( 2 min )
    Spiking Synaptic Penalty: Appropriate Penalty Term for Energy-Efficient Spiking Neural Networks. (arXiv:2302.01500v1 [cs.LG])
    Spiking neural networks (SNNs) are energy-efficient neural networks because of their spiking nature. However, as the spike firing rate of SNNs increases, the energy consumption does as well, and thus, the advantage of SNNs diminishes. Here, we tackle this problem by introducing a novel penalty term for the spiking activity into the objective function in the training phase. Our method is designed so as to optimize the energy consumption metric directly without modifying the network architecture. Therefore, the proposed method can reduce the energy consumption more than other methods while maintaining the accuracy. We conducted experiments for image classification tasks, and the results indicate the effectiveness of the proposed method, which mitigates the dilemma of the energy--accuracy trade-off.  ( 2 min )
    Convergence of Gradient Descent with Linearly Correlated Noise and Applications to Differentially Private Learning. (arXiv:2302.01463v1 [cs.LG])
    We study stochastic optimization with linearly correlated noise. Our study is motivated by recent methods for optimization with differential privacy (DP), such as DP-FTRL, which inject noise via matrix factorization mechanisms. We propose an optimization problem that distils key facets of these DP methods and that involves perturbing gradients by linearly correlated noise. We derive improved convergence rates for gradient descent in this framework for convex and non-convex loss functions. Our theoretical analysis is novel and might be of independent interest. We use these convergence rates to develop new, effective matrix factorizations for differentially private optimization, and highlight the benefits of these factorizations theoretically and empirically.  ( 2 min )
    Commonsense-Aware Prompting for Controllable Empathetic Dialogue Generation. (arXiv:2302.01441v1 [cs.CL])
    Improving the emotional awareness of pre-trained language models is an emerging important problem for dialogue generation tasks. Although prior studies have introduced methods to improve empathetic dialogue generation, few have discussed how to incorporate commonsense knowledge into pre-trained language models for controllable dialogue generation. In this study, we propose a novel framework that improves empathetic dialogue generation using pre-trained language models by 1) incorporating commonsense knowledge through prompt verbalization, and 2) controlling dialogue generation using a strategy-driven future discriminator. We conducted experiments to reveal that both the incorporation of social commonsense knowledge and enforcement of control over generation help to improve generation performance. Finally, we discuss the implications of our study for future research.  ( 2 min )
    Generalized Uncertainty of Deep Neural Networks: Taxonomy and Applications. (arXiv:2302.01440v1 [cs.LG])
    Deep neural networks have seen enormous success in various real-world applications. Beyond their predictions as point estimates, increasing attention has been focused on quantifying the uncertainty of their predictions. In this review, we show that the uncertainty of deep neural networks is not only important in a sense of interpretability and transparency, but also crucial in further advancing their performance, particularly in learning systems seeking robustness and efficiency. We will generalize the definition of the uncertainty of deep neural networks to any number or vector that is associated with an input or an input-label pair, and catalog existing methods on ``mining'' such uncertainty from a deep model. We will include those methods from the classic field of uncertainty quantification as well as those methods that are specific to deep neural networks. We then show a wide spectrum of applications of such generalized uncertainty in realistic learning tasks including robust learning such as noisy learning, adversarially robust learning; data-efficient learning such as semi-supervised and weakly-supervised learning; and model-efficient learning such as model compression and knowledge distillation.  ( 2 min )
    Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search. (arXiv:2302.01382v1 [cs.LG])
    Serving large-scale machine learning (ML) models efficiently and with low latency has become challenging owing to increasing model size and complexity. Quantizing models can simultaneously reduce memory and compute requirements, facilitating their widespread access. However, for large models not all layers are equally amenable to the same numerical precision and aggressive quantization can lead to unacceptable loss in model accuracy. One approach to prevent this accuracy degradation is mixed-precision quantization, which allows different tensors to be quantized to varying levels of numerical precision, leveraging the capabilities of modern hardware. Such mixed-precision quantiztaion can more effectively allocate numerical precision to different tensors `as needed' to preserve model accuracy while reducing footprint and compute latency. In this paper, we propose a method to efficiently determine quantization configurations of different tensors in ML models using post-training mixed precision quantization. We analyze three sensitivity metrics and evaluate them for guiding configuration search of two algorithms. We evaluate our method for computer vision and natural language processing and demonstrate latency reductions of up to 27.59% and 34.31% compared to the baseline 16-bit floating point model while guaranteeing no more than 1% accuracy degradation.  ( 2 min )
    A Reduction-based Framework for Sequential Decision Making with Delayed Feedback. (arXiv:2302.01477v1 [cs.LG])
    We study stochastic delayed feedback in general multi-agent sequential decision making, which includes bandits, single-agent Markov decision processes (MDPs), and Markov games (MGs). We propose a novel reduction-based framework, which turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that can handle stochastic delays in sequential decision making. By plugging different multi-batched algorithms into our framework, we provide several examples demonstrating that our framework not only matches or improves existing results for bandits, tabular MDPs, and tabular MGs, but also provides the first line of studies on delays in sequential decision making with function approximation. In summary, we provide a complete set of sharp results for multi-agent sequential decision making with delayed feedback.  ( 2 min )
    Out of Context: Investigating the Bias and Fairness Concerns of "Artificial Intelligence as a Service". (arXiv:2302.01448v1 [cs.LG])
    "AI as a Service" (AIaaS) is a rapidly growing market, offering various plug-and-play AI services and tools. AIaaS enables its customers (users) - who may lack the expertise, data, and/or resources to develop their own systems - to easily build and integrate AI capabilities into their applications. Yet, it is known that AI systems can encapsulate biases and inequalities that can have societal impact. This paper argues that the context-sensitive nature of fairness is often incompatible with AIaaS' 'one-size-fits-all' approach, leading to issues and tensions. Specifically, we review and systematise the AIaaS space by proposing a taxonomy of AI services based on the levels of autonomy afforded to the user. We then critically examine the different categories of AIaaS, outlining how these services can lead to biases or be otherwise harmful in the context of end-user applications. In doing so, we seek to draw research attention to the challenges of this emerging area.  ( 2 min )
    Continual Learning with Scaled Gradient Projection. (arXiv:2302.01386v1 [cs.LG])
    In neural networks, continual learning results in gradient interference among sequential tasks, leading to catastrophic forgetting of old tasks while learning new ones. This issue is addressed in recent methods by storing the important gradient spaces for old tasks and updating the model orthogonally during new tasks. However, such restrictive orthogonal gradient updates hamper the learning capability of the new tasks resulting in sub-optimal performance. To improve new learning while minimizing forgetting, in this paper we propose a Scaled Gradient Projection (SGP) method, where we combine the orthogonal gradient projections with scaled gradient steps along the important gradient spaces for the past tasks. The degree of gradient scaling along these spaces depends on the importance of the bases spanning them. We propose an efficient method for computing and accumulating importance of these bases using the singular value decomposition of the input representations for each task. We conduct extensive experiments ranging from continual image classification to reinforcement learning tasks and report better performance with less training overhead than the state-of-the-art approaches.  ( 2 min )
    Effective Robustness against Natural Distribution Shifts for Models with Different Training Data. (arXiv:2302.01381v1 [cs.LG])
    ``Effective robustness'' measures the extra out-of-distribution (OOD) robustness beyond what can be predicted from the in-distribution (ID) performance. Existing effective robustness evaluations typically use a single test set such as ImageNet to evaluate ID accuracy. This becomes problematic when evaluating models trained on different data distributions, e.g., comparing models trained on ImageNet vs. zero-shot language-image pre-trained models trained on LAION. In this paper, we propose a new effective robustness evaluation metric to compare the effective robustness of models trained on different data distributions. To do this we control for the accuracy on multiple ID test sets that cover the training distributions for all the evaluated models. Our new evaluation metric provides a better estimate of the effectiveness robustness and explains the surprising effective robustness gains of zero-shot CLIP-like models exhibited when considering only one ID dataset, while the gains diminish under our evaluation.  ( 2 min )
    Hyper-parameter Tuning for Fair Classification without Sensitive Attribute Access. (arXiv:2302.01385v1 [cs.LG])
    Fair machine learning methods seek to train models that balance model performance across demographic subgroups defined over sensitive attributes like race and gender. Although sensitive attributes are typically assumed to be known during training, they may not be available in practice due to privacy and other logistical concerns. Recent work has sought to train fair models without sensitive attributes on training data. However, these methods need extensive hyper-parameter tuning to achieve good results, and hence assume that sensitive attributes are known on validation data. However, this assumption too might not be practical. Here, we propose Antigone, a framework to train fair classifiers without access to sensitive attributes on either training or validation data. Instead, we generate pseudo sensitive attributes on the validation data by training a biased classifier and using the classifier's incorrectly (correctly) labeled examples as proxies for minority (majority) groups. Since fairness metrics like demographic parity, equal opportunity and subgroup accuracy can be estimated to within a proportionality constant even with noisy sensitive attribute information, we show theoretically and empirically that these proxy labels can be used to maximize fairness under average accuracy constraints. Key to our results is a principled approach to select the hyper-parameters of the biased classifier in a completely unsupervised fashion (meaning without access to ground truth sensitive attributes) that minimizes the gap between fairness estimated using noisy versus ground-truth sensitive labels.  ( 2 min )
    Neural Insights for Digital Marketing Content Design. (arXiv:2302.01416v1 [cs.LG])
    In digital marketing, experimenting with new website content is one of the key levers to improve customer engagement. However, creating successful marketing content is a manual and time-consuming process that lacks clear guiding principles. This paper seeks to close the loop between content creation and online experimentation by offering marketers AI-driven actionable insights based on historical data to improve their creative process. We present a neural-network-based system that scores and extracts insights from a marketing content design, namely, a multimodal neural network predicts the attractiveness of marketing contents, and a post-hoc attribution method generates actionable insights for marketers to improve their content in specific marketing locations. Our insights not only point out the advantages and drawbacks of a given current content, but also provide design recommendations based on historical data. We show that our scoring model and insights work well both quantitatively and qualitatively.  ( 2 min )
    Provably Bounding Neural Network Preimages. (arXiv:2302.01404v1 [cs.LG])
    Most work on the formal verification of neural networks has focused on bounding forward images of neural networks, i.e., the set of outputs of a neural network that correspond to a given set of inputs (for example, bounded perturbations of a nominal input). However, many use cases of neural network verification require solving the inverse problem, i.e, over-approximating the set of inputs that lead to certain outputs. In this work, we present the first efficient bound propagation algorithm, INVPROP, for verifying properties over the preimage of a linearly constrained output set of a neural network, which can be combined with branch-and-bound to achieve completeness. Our efficient algorithm allows multiple passes of intermediate bound refinements, which are crucial for tight inverse verification because the bounds of an intermediate layer depend on relaxations both before and after this layer. We demonstrate our algorithm on applications related to quantifying safe control regions for a dynamical system and detecting out-of-distribution inputs to a neural network. Our results show that in certain settings, we can find over-approximations that are over 2500 times tighter than prior work while being 2.5 times faster on the same hardware.  ( 2 min )
    A Convolutional-based Model for Early Prediction of Alzheimer's based on the Dementia Stage in the MRI Brain Images. (arXiv:2302.01417v1 [cs.LG])
    Alzheimer's disease is a degenerative brain disease. Being the primary cause of Dementia in adults and progressively destroys brain memory. Though Alzheimer's disease does not have a cure currently, diagnosing it at an earlier stage will help reduce the severity of the disease. Thus, early diagnosis of Alzheimer's could help to reduce or stop the disease from progressing. In this paper, we proposed a deep convolutional neural network-based model for learning model using to determine the stage of Dementia in adults based on the Magnetic Resonance Imaging (MRI) images to detect the early onset of Alzheimer's.  ( 2 min )
    Hyperbolic Contrastive Learning. (arXiv:2302.01409v1 [cs.CV])
    Learning good image representations that are beneficial to downstream tasks is a challenging task in computer vision. As such, a wide variety of self-supervised learning approaches have been proposed. Among them, contrastive learning has shown competitive performance on several benchmark datasets. The embeddings of contrastive learning are arranged on a hypersphere that results in using the inner (dot) product as a distance measurement in Euclidean space. However, the underlying structure of many scientific fields like social networks, brain imaging, and computer graphics data exhibit highly non-Euclidean latent geometry. We propose a novel contrastive learning framework to learn semantic relationships in the hyperbolic space. Hyperbolic space is a continuous version of trees that naturally owns the ability to model hierarchical structures and is thus beneficial for efficient contrastive representation learning. We also extend the proposed Hyperbolic Contrastive Learning (HCL) to the supervised domain and studied the adversarial robustness of HCL. The comprehensive experiments show that our proposed method achieves better results on self-supervised pretraining, supervised classification, and higher robust accuracy than baseline methods.  ( 2 min )
    Dataset Distillation Fixes Dataset Reconstruction Attacks. (arXiv:2302.01428v1 [cs.LG])
    Modern deep learning requires large volumes of data, which could contain sensitive or private information which cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effectiveness regime, and ways to defend against it. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover its entire training set in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. More importantly, we formally show for the first time that dataset reconstruction attacks are a variation of dataset distillation. This key theoretical result on the unification of dataset reconstruction and distillation not only sheds more light on the characteristics of the attack but enables us to design defense mechanisms against them via distillation algorithms.  ( 2 min )
    On the Robustness of Randomized Ensembles to Adversarial Perturbations. (arXiv:2302.01375v1 [cs.LG])
    Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts on their efficacy and prompting fundamental questions such as: "When are RECs useful?", "What are their limits?", and "How do we train them?". In this work, we first demystify RECs as we derive fundamental results regarding their theoretical limits, necessary and sufficient conditions for them to be useful, and more. Leveraging this new understanding, we propose a new boosting algorithm (BARRE) for training robust RECs, and empirically demonstrate its effectiveness at defending against strong $\ell_\infty$ norm-bounded adversaries across various network architectures and datasets.  ( 2 min )
    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective. (arXiv:2302.01425v1 [cs.LG])
    The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.  ( 2 min )
    Accelerating Policy Gradient by Estimating Value Function from Prior Computation in Deep Reinforcement Learning. (arXiv:2302.01399v1 [cs.LG])
    This paper investigates the use of prior computation to estimate the value function to improve sample efficiency in on-policy policy gradient methods in reinforcement learning. Our approach is to estimate the value function from prior computations, such as from the Q-network learned in DQN or the value function trained for different but related environments. In particular, we learn a new value function for the target task while combining it with a value estimate from the prior computation. Finally, the resulting value function is used as a baseline in the policy gradient method. This use of a baseline has the theoretical property of reducing variance in gradient computation and thus improving sample efficiency. The experiments show the successful use of prior value estimates in various settings and improved sample efficiency in several tasks.  ( 2 min )
    Personalized Understanding of Blood Glucose Dynamics via Mobile Sensor Data. (arXiv:2302.01400v1 [cs.HC])
    Continuous Blood Glucose (CGM) monitors have revolutionized the ability of diabetics to manage their blood glucose, and paved the way for artificial pancreas systems. In this paper we augment CGM data with sensor input collected by a smart phone and use it to provide analytical tools for patients and clinicians. We collected GPS data, activity classifications, and blood glucose data with a custom iOS application over a 9 month period from a single free-living type-1 diabetic patient. This data set is novel in terms of it's size, the inclusion of GPS data, and the fact that it was collected non-intrusively from a free-living patient. We describe a method to measure the occurrence of lifestyle \textit{events} based on GPS and activity data, and show that they can capture instances of food consumption and are therefore correlated to changes in blood glucose. Finally, we incorporate these event representations into our system to create useful visualizations and notifications to aid patients in managing their diabetes.  ( 2 min )
    Learning with Exposure Constraints in Recommendation Systems. (arXiv:2302.01377v1 [cs.LG])
    Recommendation systems are dynamic economic systems that balance the needs of multiple stakeholders. A recent line of work studies incentives from the content providers' point of view. Content providers, e.g., vloggers and bloggers, contribute fresh content and rely on user engagement to create revenue and finance their operations. In this work, we propose a contextual multi-armed bandit setting to model the dependency of content providers on exposure. In our model, the system receives a user context in every round and has to select one of the arms. Every arm is a content provider who must receive a minimum number of pulls every fixed time period (e.g., a month) to remain viable in later rounds; otherwise, the arm departs and is no longer available. The system aims to maximize the users' (content consumers) welfare. To that end, it should learn which arms are vital and ensure they remain viable by subsidizing arm pulls if needed. We develop algorithms with sub-linear regret, as well as a lower bound that demonstrates that our algorithms are optimal up to logarithmic factors.  ( 2 min )
    Neural Network Architecture for Database Augmentation Using Shared Features. (arXiv:2302.01374v1 [cs.LG])
    The popularity of learning from data with machine learning and neural networks has lead to the creation of many new datasets for almost every problem domain. However, even within a single domain, these datasets are often collected with disparate features, sampled from different sub-populations, and recorded at different time points. Even with the plethora of individual datasets, large data science projects can be difficult as it is often not trivial to merge these smaller datasets. Inherent challenges in some domains such as medicine also makes it very difficult to create large single source datasets or multi-source datasets with identical features. Instead of trying to merge these non-matching datasets directly, we propose a neural network architecture that can provide data augmentation using features common between these datasets. Our results show that this style of data augmentation can work for both image and tabular data.  ( 2 min )
    Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen's f2. (arXiv:2302.01407v1 [stat.ME])
    Deep artificial neural networks show high predictive performance in many fields, but they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend. Because positing that a relationship exists is often more important than prediction in scientific experiments and research models, machine learning is far less frequently used than inferential statistics. Additionally, statistics calls for improving the test of theory by showing the magnitude of the phenomena being studied. This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning. First, Fisher's variable permutation algorithm is tweaked to compute an effect size measure equivalent to Cohen's f2 for OLS regression models. Second, the Mann-Kendall test of monotonicity and the Theil-Sen estimator is applied to Apley's accumulated local effect plots to specify a variable's direction of influence and statistical significance. The usefulness of this approach is demonstrated on an artificial data set and a social survey with a Python sandbox implementation.  ( 2 min )
    Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees. (arXiv:2302.01367v1 [stat.ML])
    Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.  ( 2 min )
  • Open

    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v3 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    Realizable Learning is All You Need. (arXiv:2111.04746v3 [cs.LG] UPDATED)
    The equivalence of realizable and agnostic learnability is a fundamental phenomenon in learning theory. With variants ranging from classical settings like PAC learning and regression to recent trends such as adversarially robust learning, it's surprising that we still lack a unified theory; traditional proofs of the equivalence tend to be disparate, and rely on strong model-specific assumptions like uniform convergence and sample compression. In this work, we give the first model-independent framework explaining the equivalence of realizable and agnostic learnability: a three-line blackbox reduction that simplifies, unifies, and extends our understanding across a wide variety of settings. This includes models with no known characterization of learnability such as learning with arbitrary distributional assumptions and more general loss functions, as well as a host of other popular settings such as robust learning, partial learning, fair learning, and the statistical query model. More generally, we argue that the equivalence of realizable and agnostic learning is actually a special case of a broader phenomenon we call property generalization: any desirable property of a learning algorithm (e.g. noise tolerance, privacy, stability) that can be satisfied over finite hypothesis classes extends (possibly in some variation) to any learnable hypothesis class.  ( 2 min )
    Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. (arXiv:2210.02720v2 [cs.LG] UPDATED)
    Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the algorithmic perspective, that is, the algorithms of GR that efficiently improve the performance. In this study, we first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost of GR. Next, we show that the finite-difference computation also works better in the sense of generalization performance. We theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias to so-called rich regime and finite-difference computation strengthens this bias. Furthermore, finite-difference GR is closely related to some other algorithms based on iterative ascent and descent steps for exploring flat minima. In particular, we reveal that the flooding method can perform finite-difference GR in an implicit way. Thus, this work broadens our understanding of GR for both practice and theory.  ( 2 min )
    New Machine Learning Techniques for Simulation-Based Inference: InferoStatic Nets, Kernel Score Estimation, and Kernel Likelihood Ratio Estimation. (arXiv:2210.01680v2 [stat.ML] UPDATED)
    We propose an intuitive, machine-learning approach to multiparameter inference, dubbed the InferoStatic Networks (ISN) method, to model the score and likelihood ratio estimators in cases when the probability density can be sampled but not computed directly. The ISN uses a backend neural network that models a scalar function called the inferostatic potential $\varphi$. In addition, we introduce new strategies, respectively called Kernel Score Estimation (KSE) and Kernel Likelihood Ratio Estimation (KLRE), to learn the score and the likelihood ratio functions from simulated data. We illustrate the new techniques with some toy examples and compare to existing approaches in the literature. We mention en passant some new loss functions that optimally incorporate latent information from simulations into the training procedure.  ( 2 min )
    Beyond Invariance: Test-Time Label-Shift Adaptation for Distributions with "Spurious" Correlations. (arXiv:2211.15646v2 [stat.ML] UPDATED)
    Spurious correlations, or correlations that change across domains where a model can be deployed, present significant challenges to real-world applications of machine learning models. However, such correlations are not always "spurious"; often, they provide valuable prior information for a prediction. Here, we present a test-time adaptation method that exploits the spurious correlation phenomenon, in contrast to recent approaches that attempt to eliminate spurious correlations through invariance. We consider situations where the prior distribution $p(y, z)$, which models the dependence between the class label $y$ and the "nuisance" factors $z$, may change across domains, but the generative model for features $p(\mathbf{x}|y, z)$ is constant. We note that this corresponds to an expanded version of the label shift assumption, where the labels now also include the nuisance factors $z$. Based on this observation, we train a classifier to predict $p(y, z|\mathbf{x})$ on the source distribution, and propose a test-time label shift correction that adapts to changes in the marginal distribution $p(y, z)$ using unlabeled samples from the target domain. We evaluate our method, which we call "Test-Time Label-Shift Adaptation" (TTLSA), on two different image datasets -- the CheXpert chest X-ray dataset and the Colored MNIST dataset -- and show a significant improvement over baseline methods. Code reproducing experiments is available at https://github.com/nalzok/test-time-label-shift .  ( 2 min )
    Statistical treatment of convolutional neural network super-resolution of inland surface wind for subgrid-scale variability quantification. (arXiv:2211.16708v2 [physics.ao-ph] UPDATED)
    Machine learning models have been employed to perform either physics-free data-driven or hybrid dynamical downscaling of climate data. Most of these implementations operate over relatively small downscaling factors because of the challenge of recovering fine-scale information from coarse data. This limits their compatibility with many global climate model outputs, often available between $\sim$50--100 km resolution, to scales of interest such as cloud resolving or urban scales. This study systematically examines the capability of convolutional neural networks (CNNs) to downscale surface wind speed data over land surface from different coarse resolutions (25 km, 48 km, and 100 km resolution) to 3 km. For each downscaling factor, we consider three CNN configurations that generate super-resolved predictions of fine-scale wind speed, which take between 1 to 3 input fields: coarse wind speed, fine-scale topography, and diurnal cycle. In addition to fine-scale wind speeds, probability density function parameters are generated, through which sample wind speeds can be generated accounting for the intrinsic stochasticity of wind speed. For generalizability assessment, CNN models are tested on regions with different topography and climate that are unseen during training. The evaluation of super-resolved predictions focuses on subgrid-scale variability and the recovery of extremes. Models with coarse wind and fine topography as inputs exhibit the best performance compared with other model configurations, operating across the same downscaling factor. Our diurnal cycle encoding results in lower out-of-sample generalizability compared with other input configurations.  ( 2 min )
    Consistent Range Approximation for Fair Predictive Modeling. (arXiv:2212.10839v2 [cs.LG] UPDATED)
    This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.  ( 2 min )
    FiT: Parameter Efficient Few-shot Transfer Learning for Personalized and Federated Image Classification. (arXiv:2206.08671v2 [stat.ML] UPDATED)
    Modern deep learning systems are increasingly deployed in situations such as personalization and federated learning where it is necessary to support i) learning on small amounts of data, and ii) communication efficient distributed training protocols. In this work, we develop FiLM Transfer (FiT) which fulfills these requirements in the image classification setting by combining ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. The resulting parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. We experiment with FiT on a wide range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Finally, we demonstrate the parameter efficiency and superior accuracy of FiT in distributed low-shot applications including model personalization and federated learning where model update size is an important performance metric.  ( 2 min )
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v3 [stat.ME] UPDATED)
    We consider the estimation of average treatment effects in observational studies and propose a new framework of robust causal inference with unobserved confounders. Our approach is based on distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of observed outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case, and we demonstrate how the proposed methodology can address a primary challenge of the marginal sensitivity model that it produces uninformative results when unobserved confounders substantially affect treatment and outcome. Specifically, we develop an alternative sensitivity model, called the distributional sensitivity model, under the assumption that heterogeneity of treatment effect due to unobserved variables is relatively small. Unlike the marginal sensitivity model, the distributional sensitivity model allows for potential lack of overlap and often produces informative bounds even when unobserved variables substantially affect both treatment and outcome. Finally, we show how to extend the distributional sensitivity model to difference-in-differences designs and settings with instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology.  ( 2 min )
    Learning Counterfactually Invariant Predictors. (arXiv:2207.09768v2 [cs.LG] UPDATED)
    Counterfactual invariance has proven an essential property for predictors that are fair, robust, and generalizable in the real world. We propose a general definition of counterfactual invariance and provide simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), based on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various types of data including tabular, high-dimensional, and real-world dataset.  ( 2 min )
    DEUP: Direct Epistemic Uncertainty Prediction. (arXiv:2102.08501v4 [cs.LG] UPDATED)
    Epistemic Uncertainty is a measure of the lack of knowledge of a learner which diminishes with more evidence. While existing work focuses on using the variance of the Bayesian posterior due to parameter uncertainty as a measure of epistemic uncertainty, we argue that this does not capture the part of lack of knowledge induced by model misspecification. We discuss how the excess risk, which is the gap between the generalization error of a predictor and the Bayes predictor, is a sound measure of epistemic uncertainty which captures the effect of model misspecification. We thus propose a principled framework for directly estimating the excess risk by learning a secondary predictor for the generalization error and subtracting an estimate of aleatoric uncertainty, i.e., intrinsic unpredictability. We discuss the merits of this novel measure of epistemic uncertainty, and highlight how it differs from variance-based measures of epistemic uncertainty and addresses its major pitfall. Our framework, Direct Epistemic Uncertainty Prediction (DEUP) is particularly interesting in interactive learning environments, where the learner is allowed to acquire novel examples in each round. Through a wide set of experiments, we illustrate how existing methods in sequential model optimization can be improved with epistemic uncertainty estimates from DEUP, and how DEUP can be used to drive exploration in reinforcement learning. We also evaluate the quality of uncertainty estimates from DEUP for probabilistic image classification and predicting synergies of drug combinations.  ( 2 min )
    Data Representativity for Machine Learning and AI Systems. (arXiv:2203.04706v2 [stat.ML] UPDATED)
    Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.  ( 2 min )
    From Robustness to Privacy and Back. (arXiv:2302.01855v1 [cs.LG])
    We study the relationship between two desiderata of algorithms in statistical inference and machine learning: differential privacy and robustness to adversarial data corruptions. Their conceptual similarity was first observed by Dwork and Lei (STOC 2009), who observed that private algorithms satisfy robustness, and gave a general method for converting robust algorithms to private ones. However, all general methods for transforming robust algorithms into private ones lead to suboptimal error rates. Our work gives the first black-box transformation that converts any adversarially robust algorithm into one that satisfies pure differential privacy. Moreover, we show that for any low-dimensional estimation task, applying our transformation to an optimal robust estimator results in an optimal private estimator. Thus, we conclude that for any low-dimensional task, the optimal error rate for $\varepsilon$-differentially private estimators is essentially the same as the optimal error rate for estimators that are robust to adversarially corrupting $1/\varepsilon$ training samples. We apply our transformation to obtain new optimal private estimators for several high-dimensional tasks, including Gaussian (sparse) linear regression and PCA. Finally, we present an extension of our transformation that leads to approximate differentially private algorithms whose error does not depend on the range of the output space, which is impossible under pure differential privacy.  ( 2 min )
    Sample Complexity of Probability Divergences under Group Symmetry. (arXiv:2302.01915v1 [math.ST])
    We rigorously quantify the improvement in the sample complexity of variational divergence estimations for group-invariant distributions. In the cases of the Wasserstein-1 metric and the Lipschitz-regularized $\alpha$-divergences, the reduction of sample complexity is proportional to an ambient-dimension-dependent power of the group size. For the maximum mean discrepancy (MMD), the improvement of sample complexity is more nuanced, as it depends on not only the group size but also the choice of kernel. Numerical simulations verify our theories.  ( 2 min )
    Certified Robustness of Learning-based Static Malware Detectors. (arXiv:2302.01757v1 [cs.CR])
    Certified defenses are a recent development in adversarial machine learning (ML), which aim to rigorously guarantee the robustness of ML models to adversarial perturbations. A large body of work studies certified defenses in computer vision, where $\ell_p$ norm-bounded evasion attacks are adopted as a tractable threat model. However, this threat model has known limitations in vision, and is not applicable to other domains -- e.g., where inputs may be discrete or subject to complex constraints. Motivated by this gap, we study certified defenses for malware detection, a domain where attacks against ML-based systems are a real and current threat. We consider static malware detection systems that operate on byte-level data. Our certified defense is based on the approach of randomized smoothing which we adapt by: (1) replacing the standard Gaussian randomization scheme with a novel deletion randomization scheme that operates on bytes or chunks of an executable; and (2) deriving a certificate that measures robustness to evasion attacks in terms of generalized edit distance. To assess the size of robustness certificates that are achievable while maintaining high accuracy, we conduct experiments on malware datasets using a popular convolutional malware detection model, MalConv. We are able to accurately classify 91% of the inputs while being certifiably robust to any adversarial perturbations of edit distance 128 bytes or less. By comparison, an existing certification of up to 128 bytes of substitutions (without insertions or deletions) achieves an accuracy of 78%. In addition, given that robustness certificates are conservative, we evaluate practical robustness to several recently published evasion attacks and, in some cases, find robustness beyond certified guarantees.  ( 2 min )
    Using Explainability to Inform Statistical Downscaling Based on Deep Learning Beyond Standard Validation Approaches. (arXiv:2302.01771v1 [stat.ML])
    Deep learning (DL) has emerged as a promising tool to downscale climate projections at regional-to-local scales from large-scale atmospheric fields following the perfect-prognosis (PP) approach. Given their complexity, it is crucial to properly evaluate these methods, especially when applied to changing climatic conditions where the ability to extrapolate/generalise is key. In this work, we intercompare several DL models extracted from the literature for the same challenging use-case (downscaling temperature in the CORDEX North America domain) and expand standard evaluation methods building on eXplainable artifical intelligence (XAI) techniques. We show how these techniques can be used to unravel the internal behaviour of these models, providing new evaluation dimensions and aiding in their diagnostic and design. These results show the usefulness of incorporating XAI techniques into statistical downscaling evaluation frameworks, especially when working with large regions and/or under climate change conditions.  ( 2 min )
    Leveraging a Probabilistic PCA Model to Understand the Multivariate Statistical Network Monitoring Framework for Network Security Anomaly Detection. (arXiv:2302.01759v1 [stat.ML])
    Network anomaly detection is a very relevant research area nowadays, especially due to its multiple applications in the field of network security. The boost of new models based on variational autoencoders and generative adversarial networks has motivated a reevaluation of traditional techniques for anomaly detection. It is, however, essential to be able to understand these new models from the perspective of the experience attained from years of evaluating network security data for anomaly detection. In this paper, we revisit anomaly detection techniques based on PCA from a probabilistic generative model point of view, and contribute a mathematical model that relates them. Specifically, we start with the probabilistic PCA model and explain its connection to the Multivariate Statistical Network Monitoring (MSNM) framework. MSNM was recently successfully proposed as a means of incorporating industrial process anomaly detection experience into the field of networking. We have evaluated the mathematical model using two different datasets. The first, a synthetic dataset created to better understand the analysis proposed, and the second, UGR'16, is a specifically designed real-traffic dataset for network security anomaly detection. We have drawn conclusions that we consider to be useful when applying generative models to network security detection.  ( 2 min )
    ResMem: Learn what you can and memorize the rest. (arXiv:2302.01576v1 [cs.LG])
    The impressive generalization performance of modern neural networks is attributed in part to their ability to implicitly memorize complex training patterns. Inspired by this, we explore a novel mechanism to improve model generalization via explicit memorization. Specifically, we propose the residual-memorization (ResMem) algorithm, a new method that augments an existing prediction model (e.g. a neural network) by fitting the model's residuals with a $k$-nearest neighbor based regressor. The final prediction is then the sum of the original model and the fitted residual regressor. By construction, ResMem can explicitly memorize the training labels. Empirically, we show that ResMem consistently improves the test set generalization of the original prediction model across various standard vision and natural language processing benchmarks. Theoretically, we formulate a stylized linear regression problem and rigorously show that ResMem results in a more favorable test risk over the base predictor.  ( 2 min )
    Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits. (arXiv:2302.01544v1 [cs.LG])
    In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.  ( 2 min )
    Beyond the Universal Law of Robustness: Sharper Laws for Random Features and Neural Tangent Kernels. (arXiv:2302.01629v1 [stat.ML])
    Machine learning models are vulnerable to adversarial perturbations, and a thought-provoking paper by Bubeck and Sellke has analyzed this phenomenon through the lens of over-parameterization: interpolating smoothly the data requires significantly more parameters than simply memorizing it. However, this "universal" law provides only a necessary condition for robustness, and it is unable to discriminate between models. In this paper, we address these gaps by focusing on empirical risk minimization in two prototypical settings, namely, random features and the neural tangent kernel (NTK). We prove that, for random features, the model is not robust for any degree of over-parameterization, even when the necessary condition coming from the universal law of robustness is satisfied. In contrast, for even activations, the NTK model meets the universal lower bound, and it is robust as soon as the necessary condition on over-parameterization is fulfilled. This also addresses a conjecture in prior work by Bubeck, Li and Nagaraj. Our analysis decouples the effect of the kernel of the model from an "interaction matrix", which describes the interaction with the test data and captures the effect of the activation. Our theoretical results are corroborated by numerical evidence on both synthetic and standard datasets (MNIST, CIFAR-10).  ( 2 min )
    Using natural language processing and structured medical data to phenotype patients hospitalized due to COVID-19. (arXiv:2302.01536v1 [cs.CL])
    To identify patients who are hospitalized because of COVID-19 as opposed to those who were admitted for other indications, we compared the performance of different computable phenotype definitions for COVID-19 hospitalizations that use different types of data from the electronic health records (EHR), including structured EHR data elements, provider notes, or a combination of both data types. And conduct a retrospective data analysis utilizing chart review-based validation. Participants are 586 hospitalized individuals who tested positive for SARS-CoV-2 during January 2022. We used natural language processing to incorporate data from provider notes and LASSO regression and Random Forests to fit classification algorithms that incorporated structured EHR data elements, provider notes, or a combination of structured data and provider notes. Results: Based on a chart review, 38% of 586 patients were determined to be hospitalized for reasons other than COVID-19 despite having tested positive for SARS-CoV-2. A classification algorithm that used provider notes had significantly better discrimination than one that used structured EHR data elements (AUROC: 0.894 vs 0.841, p < 0.001), and performed similarly to a model that combined provider notes with structured data elements (AUROC: 0.894 vs 0.893). Assessments of hospital outcome metrics significantly differed based on whether the population included all hospitalized patients who tested positive for SARS-CoV-2 versus those who were determined to have been hospitalized due to COVID-19. This work demonstrates the utility of natural language processing approaches to derive information related to patient hospitalizations in cases where there may be multiple conditions that could serve as the primary indication for hospitalization.  ( 3 min )
    Where and How to Improve Graph-based Spatio-temporal Predictors. (arXiv:2302.01701v1 [stat.ML])
    This paper introduces a novel residual correlation analysis, called AZ-analysis, to assess the optimality of spatio-temporal predictive models. The proposed AZ-analysis constitutes a valuable asset for discovering and highlighting those space-time regions where the model can be improved with respect to performance. The AZ-analysis operates under very mild assumptions and is based on a spatio-temporal graph that encodes serial and functional dependencies in the data; asymptotically distribution-free summary statistics identify existing residual correlation in space and time regions, hence localizing time frames and/or communities of sensors, where the predictor can be improved.  ( 2 min )
    A Lipschitz Bandits Approach for Continuous Hyperparameter Optimization. (arXiv:2302.01539v1 [cs.LG])
    One of the most critical problems in machine learning is HyperParameter Optimization (HPO), since choice of hyperparameters has a significant impact on final model performance. Although there are many HPO algorithms, they either have no theoretical guarantees or require strong assumptions. To this end, we introduce BLiE -- a Lipschitz-bandit-based algorithm for HPO that only assumes Lipschitz continuity of the objective function. BLiE exploits the landscape of the objective function to adaptively search over the hyperparameter space. Theoretically, we show that $(i)$ BLiE finds an $\epsilon$-optimal hyperparameter with $O \left( \frac{1}{\epsilon} \right)^{d_z + \beta}$ total budgets, where $d_z$ and $\beta$ are problem intrinsic; $(ii)$ BLiE is highly parallelizable. Empirically, we demonstrate that BLiE outperforms the state-of-the-art HPO algorithms on benchmark tasks. We also apply BLiE to search for noise schedule of diffusion models. Comparison with the default schedule shows that BLiE schedule greatly improves the sampling speed.  ( 2 min )
    Support Recovery in Sparse PCA with Non-Random Missing Data. (arXiv:2302.01535v1 [stat.ML])
    We analyze a practical algorithm for sparse PCA on incomplete and noisy data under a general non-random sampling scheme. The algorithm is based on a semidefinite relaxation of the $\ell_1$-regularized PCA problem. We provide theoretical justification that under certain conditions, we can recover the support of the sparse leading eigenvector with high probability by obtaining a unique solution. The conditions involve the spectral gap between the largest and second-largest eigenvalues of the true data matrix, the magnitude of the noise, and the structural properties of the observed entries. The concepts of algebraic connectivity and irregularity are used to describe the structural properties of the observed entries. We empirically justify our theorem with synthetic and real data analysis. We also show that our algorithm outperforms several other sparse PCA approaches especially when the observed entries have good structural properties. As a by-product of our analysis, we provide two theorems to handle a deterministic sampling scheme, which can be applied to other matrix-related problems.  ( 2 min )
    Failure-informed adaptive sampling for PINNs, Part II: combining with re-sampling and subset simulation. (arXiv:2302.01529v1 [math.NA])
    This is the second part of our series works on failure-informed adaptive sampling for physic-informed neural networks (FI-PINNs). In our previous work \cite{gao2022failure}, we have presented an adaptive sampling framework by using the failure probability as the posterior error indicator, where the truncated Gaussian model has been adopted for estimating the indicator. In this work, we present two novel extensions to FI-PINNs. The first extension consist in combining with a re-sampling technique, so that the new algorithm can maintain a constant training size. This is achieved through a cosine-annealing, which gradually transforms the sampling of collocation points from uniform to adaptive via training progress. The second extension is to present the subset simulation algorithm as the posterior model (instead of the truncated Gaussian model) for estimating the error indicator, which can more effectively estimate the failure probability and generate new effective training points in the failure region. We investigate the performance of the new approach using several challenging problems, and numerical experiments demonstrate a significant improvement over the original algorithm.  ( 2 min )
    Dataset Distillation Fixes Dataset Reconstruction Attacks. (arXiv:2302.01428v1 [cs.LG])
    Modern deep learning requires large volumes of data, which could contain sensitive or private information which cannot be leaked. Recent work has shown for homogeneous neural networks a large portion of this training data could be reconstructed with only access to the trained network parameters. While the attack was shown to work empirically, there exists little formal understanding of its effectiveness regime, and ways to defend against it. In this work, we first build a stronger version of the dataset reconstruction attack and show how it can provably recover its entire training set in the infinite width regime. We then empirically study the characteristics of this attack on two-layer networks and reveal that its success heavily depends on deviations from the frozen infinite-width Neural Tangent Kernel limit. More importantly, we formally show for the first time that dataset reconstruction attacks are a variation of dataset distillation. This key theoretical result on the unification of dataset reconstruction and distillation not only sheds more light on the characteristics of the attack but enables us to design defense mechanisms against them via distillation algorithms.  ( 2 min )
    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective. (arXiv:2302.01425v1 [cs.LG])
    The top-k operator returns a k-sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-k operators. We view the top-k operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a p-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-k operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.  ( 2 min )
    Augmented Learning of Heterogeneous Treatment Effects via Gradient Boosting Trees. (arXiv:2302.01367v1 [stat.ML])
    Heterogeneous treatment effects (HTE) based on patients' genetic or clinical factors are of significant interest to precision medicine. Simultaneously modeling HTE and corresponding main effects for randomized clinical trials with high-dimensional predictive markers is challenging. Motivated by the modified covariates approach, we propose a two-stage statistical learning procedure for estimating HTE with optimal efficiency augmentation, generalizing to arbitrary interaction model and exploiting powerful extreme gradient boosting trees (XGBoost). Target estimands for HTE are defined in the scale of mean difference for quantitative outcomes, or risk ratio for binary outcomes, which are the minimizers of specialized loss functions. The first stage is to estimate the main-effect equivalency of the baseline markers on the outcome, which is then used as an augmentation term in the second stage estimation for HTE. The proposed two-stage procedure is robust to model mis-specification of main effects and improves efficiency for estimating HTE through nonparametric function estimation, e.g., XGBoost. A permutation test is proposed for global assessment of evidence for HTE. An analysis of a genetic study in Prostate Cancer Prevention Trial led by the SWOG Cancer Research Network, is conducted to showcase the properties and the utilities of the two-stage method.  ( 2 min )
    Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen's f2. (arXiv:2302.01407v1 [stat.ME])
    Deep artificial neural networks show high predictive performance in many fields, but they do not afford statistical inferences and their black-box operations are too complicated for humans to comprehend. Because positing that a relationship exists is often more important than prediction in scientific experiments and research models, machine learning is far less frequently used than inferential statistics. Additionally, statistics calls for improving the test of theory by showing the magnitude of the phenomena being studied. This article extends current XAI methods and develops a model agnostic hypothesis testing framework for machine learning. First, Fisher's variable permutation algorithm is tweaked to compute an effect size measure equivalent to Cohen's f2 for OLS regression models. Second, the Mann-Kendall test of monotonicity and the Theil-Sen estimator is applied to Apley's accumulated local effect plots to specify a variable's direction of influence and statistical significance. The usefulness of this approach is demonstrated on an artificial data set and a social survey with a Python sandbox implementation.  ( 2 min )
  • Open

    You.com released v2 of YouChat, adding multimedia content to their chat agent for search
    submitted by /u/quanik_314 [link] [comments]  ( 40 min )
    New AI Streamer 24/7
    ​ https://preview.redd.it/nidkmqbtsnga1.png?width=1536&format=png&auto=webp&s=27719de8800d612139e93fe177e2fa63f03f4e0e https://www.twitch.tv/ai_media ​ Help me to learn by chatting with me! submitted by /u/BlueBug02 [link] [comments]  ( 40 min )
  • Open

    [P] ChatGPT without size limits: upload any pdf and apply any prompt to it
    hi all! I created a simple free tool where you can summarize and query documents of any size and estimate the cost to do so: https://www.wrotescan.com You can edit the prompts as well as automatically chunk and combine documents. There's also a cost estimator for any pdf you upload. Let me know if you want me to run some examples for you! Send me a pdf and tell me what you'd like summarized or extracted. Tips Please be sure to keep {text} in both prompts or the program will not input your document's text into the map reduce summarizer. {text} can only appear once in each prompt. It is where the text from each chunk to be summarized is input into the prompts. Create a temporary OpenAI key / org to use with this site so you do not have to provide credit card information then be sure to delete the temp key when you are done. Learnings Some interesting learnings I had while creating the tool: - Minimizing the number of steps through the AI improved summarization, so map reduce was often better than a more advanced refine workflow which passes the output through the model many more times. - LangChain is great for managing multiple step language model calls and bypassing the current limitations of ChatGPT submitted by /u/aicharades [link] [comments]  ( 50 min )

  • Open

    12 highlights from Google's BARD announcement
    I went through the entire blog post from Google and pulled out some quotes and highlights: ​ 1) “we re-oriented the company around AI six years ago” Right off the bat, “Pich-AI” lets it be known that Google is now an AI company. Partially true? Yes, of course. Would that phrase be coming out of his mouth at this point if not for the release and success of ChatGPT? No. 2) their mission: “organize the world’s information and make it universally accessible and useful” There’s a book called The Innovator’s Dilemma: When New Technologies Cause Great Firms to Fail. I'm certainly not here to say that Google is going to fail, but the re-stating of the mission makes it clear that they view AI (and Bard) as a way to improve, supplement, and perhaps protect their search business. This is …  ( 45 min )
    Weekly China AI News: Baidu's Language Model Behind Rumored ChatGPT Search; Tencent-Backed Robot Startup Files for Hong Kong IPO; Xpeng Targets "Full Autonomy" in 2023
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    The Blair Witch Project come to life with Midjourney as an 80's Horror Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    Harry Potter come to life with Midjourney as an 80s Love Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    [Project] I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! (Video and Model link in post)
    submitted by /u/VR_Angel [link] [comments]  ( 41 min )
    Over the weekend I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! Enjoy!
    submitted by /u/VR_Angel [link] [comments]  ( 40 min )
    Built a Telegram AI tutor bot + updates
    Hey! Since first posting here we've got 800+ users taking almost 1,000 courses! In short, this is how the bot works: 1️⃣ Send the captain a topic - usually, one or a few words is enough 2️⃣ Get a mini-course divided into 5 chapters 3️⃣ Receive your content packed into a beautiful magazine-style sharable link Would love to know what you think! http://edwardbot.com/ submitted by /u/Itaydr [link] [comments]  ( 41 min )
    AI Dream 144 - DARK CLOUD AI FUSION - MINDBLOW MONDAY
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Google launches ChatGPT competitor "Bard" and more
    submitted by /u/Peaking_AI [link] [comments]  ( 40 min )
    Change My Mind: You either believe consciousness is biomechanistic and therefore replicable, or you believe consciousness comes from some supernatural source and therefore not replicable. If it's the former, then you should believe AI is/can be sentient.
    I'll just try and add some detail in case the title doesn't make sense for some. The two major options I see: You believe consciousness arises from a very complicated system of biology > cells > molecules > atoms > elements. Ultimately a mechanistic view that, metaphorically, life is complicated machinery. If not the former, then it seems the only other major option is that life is somehow endowed some supernatural or divine force that humans could never synthetically replicate. This is of course generally considered a spiritual/religious belief. So if you believe in item 1, doesn't this mean you pretty much have to believe that AI is or can have consciousness? submitted by /u/sidianmsjones [link] [comments]  ( 44 min )
    OpenAI Have Started a Search Engine Revolution.
    submitted by /u/shauryadevil [link] [comments]  ( 40 min )
    AI Seinfeld show banned on Twitch for transphobic comments
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 40 min )
    Gen-1: The Next Step Forward for Generative AI - Use words and images to generate new videos out of existing ones
    submitted by /u/magenta_placenta [link] [comments]  ( 40 min )
    Spotify's Founder Has Developed An AI-Powered Body Health Scanner
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    runway announces GEN-1 — video to video generative AI
    submitted by /u/AR_MR_XR [link] [comments]  ( 40 min )
    Integrate OpenAI with .NET Core and Angular15
    submitted by /u/TheDotnetoffice [link] [comments]  ( 40 min )
    Google invests 500 million dollars in the next rival of ChatGPT
    submitted by /u/nikesh96 [link] [comments]  ( 41 min )
    Seinfeld AI makes transgender joke and gets banned on twitch
    AI Seinfeld Transphobic rant - YouTube submitted by /u/Status_Signal_4083 [link] [comments]  ( 42 min )
    Hi-ResNet: High resolution image classifier. (448, 896, 1792 sq.px.)
    submitted by /u/johnGettings [link] [comments]  ( 41 min )
    Spotify’s Founder Has Developed An AI-powered Body Health Scanner
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 41 min )
    ChatFAI: Chat with your favorite characters (updates and a challenge)
    ​ Characters as shown on https://chatfai.com/characters/ Hi everyone! I have recently made some exciting changes to my ChatFAI web app. The public characters library is now live - it's now easy to share and install public characters. Added a regenerate reply option. Created a new plan without any daily limit. I have gotten a lot of help and support from this community. The feedback and support from you all are really helpful and that is how I am improving ChatFAI (based on the feedback and suggestions). So, here I am again. What do you think about the latest updates? Is it going in the right direction? Another challenge I have not resolved yet is finding B2B use cases for ChatFAI. Thank you for your help and support - it's greatly appreciated! submitted by /u/usamaejazch [link] [comments]  ( 42 min )
    Streamline Ticket Triage and Reduce Customer Churn with AI
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    I Made a Text Bot Powered by ChatGPT, DALLE 2, and Wolfram Alpha
    submitted by /u/ImplodingCoding [link] [comments]  ( 44 min )
  • Open

    MC vs TD(0) on windy gridworld
    Hi, I'm relatively new to the field of RL but with strong experience in DL, I'm currently studying the Sutton & Barto book, and more specifically TD(0) method. At some point the discuss a "windy" gridworld example, in that we have the normal gridworld setup, but there is an upward wind on some tiles that may move the agent upwards as many squares as indicated in the bottom of each column. Here is a schematic of the setup: Windy Gridworld example from Sutton Barto book If you perform an action that would lead to the agent falling of the grid, the agent remains at the state it was. E.G. if the agent is in the top-left square and performs the action `up`, they will remain at the top-left square. They essentially use this to present SARSA, but then they mention the following for MC methods: Note that Monte Carlo methods cannot easily be used here because termination is not guaranteed for all policies. If a policy was ever found that caused the agent to stay in the same state, then the next episode would never end. This strikes me quite as odd. Does this mean that I cannot (or that it's not safe to) use MC methods in any problem where it's possible to do a transition from a state that will result in the agent remaining in the same state? The first thing that comes to mind is the normal gridworld example (i.e. w/out any wind), does this restriction that they mention here mean that MC is not safe to use? submitted by /u/dep0 [link] [comments]  ( 42 min )
    [Discussion] League of Legends Reinforcement Learning Library - Interest
    Hello everyone. I am considering making a reinforcement learning library for the most recent version of League of Legends based on discussions on an existing library here. Would there be any interest in an RL library for League of Legends? The interface would work like the following: ```python import tlol.gym as leaguegym env = leaguegym.make() while True: obs = env.reset() done = False while not done: #Here we sample a random action. If you have an agent, you would get an action from it here. action = env.action_space.sample() next_obs, reward, done, gameinfo = env.step(action) obs = next_obs ``` If you look at my previous posts, I have already created an RL environment for League of Legends v4.20 where other people have also taken the project and successfully trained agents which learn adversarially against each other here. I've also released many gameplay datasets for League of Legends during Season 12 here also for supervised learning and RL. At the moment tlol-py contains an interface for ML models to play League of Legends but I'm considering creating a purpose built library for RL for League of Legends. Would there be interest in a project like this? submitted by /u/Ok-Alps-7918 [link] [comments]  ( 42 min )
    Why the sim2real problem in robotic manipulation?
    Hi all, assuming the task is opening the door with a robot, as far as I understand the sim2real problem happens as the robot behaves differently in the real world as the physics in the simulator (where the agent is trained) are not 100% identical in the real world. From my understanding the sim2real problem occurs if we let the agent also handle this controller part. But why cant we just extract the trajectory of the manipulator that the agent generates to open the door and executes it with the controller from the real world? Am I missing something here? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 43 min )
    Question on return values of the .step() method in a multi-agent environment
    Is it possible that, in a multi-agent environment, the return values for reward, terminated, and truncated of a .step(actions) method call have different values for each agent? If so, is there an example environment? submitted by /u/Toni-SM [link] [comments]  ( 41 min )
    I have implemented an RL agent for trading EUR/USD and I don't know what to do next...
    So, after months of learning about RL and doing toy implentations, I have coded a DQN, with experience buffer and dual nets. The network design is like the most average thing you can come across in ML scene. A simple deep feed forward with Relu and Linear as activation functions. I have also coded a simplified version of the Forex market for my agent to train in. It has bid ask prices, leverage, call margin, and buy/sell/not-in-the-market positions. The whole given state to the model is nothing fancy. It is merely the historical, model's balance and a few binary indictors about the environment. Since I'm cripplingly poor, I don't have any specialized hardware for training the model. After burning like 100 hours into the free version of Google collab with three different learning rates I…  ( 46 min )
    Why is my PPO algorithm not learning a simple environment?
    I have made a Stack Overflow post here. I will highly appreciate all your help on this. Thank you! submitted by /u/Academic-Rent7800 [link] [comments]  ( 42 min )
    Does it make sense to use RL for trading?
    I have seen some blog posts and papers about using RL for financial trading. I have to be hones, I didn't read that stuff in details. However the main idea seems kind of clear: you can model the market as a MDP where the state space encodes the relevant features of the market and your current portfolio, the possible actions are what/how much to sell/buy and the reward function should express the value change of your portfolio. However I am a bit puzzled. Clearly your actions as a trader do not really affect the market that much meaning that the transition function (probability distribution for the next state) does not depend on your action (excluding the changes in your portfolio). Why would RL provide any advantages over more classical approaches? I might miss something, but maybe formalizing the trading problem as a (contextual) multi-armed bandit seems more reasonable to me. submitted by /u/AdministrativeBank48 [link] [comments]  ( 48 min )
  • Open

    [Discussion] League of Legends Reinforcement Learning Library - Interest
    Hello everyone. I am considering making a reinforcement learning library for the most recent version of League of Legends based on discussions on an existing library here. Would there be any interest in an RL library for League of Legends? The interface would work like the following: ```python import tlol.gym as leaguegym env = leaguegym.make() while True: obs = env.reset() done = False while not done: #Here we sample a random action. If you have an agent, you would get an action from it here. action = env.action_space.sample() next_obs, reward, done, gameinfo = env.step(action) obs = next_obs ``` If you look at my previous posts, I have already created an RL environment for League of Legends v4.20 where other people have also taken the project and successfully trained agents which learn adversarially against each other here. I've also released many gameplay datasets for League of Legends during Season 12 here also for supervised learning and RL. At the moment tlol-py contains an interface for ML models to play League of Legends but I'm considering creating a purpose built library for RL for League of Legends. Would there be interest in a project like this? submitted by /u/Ok-Alps-7918 [link] [comments]  ( 43 min )
    [Project] Need Suggestions Improving the Model evaluation scores.
    Hi, I'm working on a project, where we're to classify a user into High or Low Income. The dataset contains 9000+ features and the number of observations/rows are, 30000 representing household. The features include the media consumption habits of people. Hourly, Weekly, Monthly and Yearly for different TV channels. So far I have tried SVC, Random Forest and Logistic Regression. I used an ensemble of these three. However, I haven't been able to get past 63% accuracy. I tried PCA, however, the results range b/w 61-63% accuracy, recall and precision. I do wanna add that the data is already scaled between 0-1 and most of the columns are sparse (0 values for many rows). Honestly, I have tried pretty much everything, but can't seem to raise the evaluation metrics. Can someone direct me to the right path on what I can do to improve the scores? submitted by /u/Toko_yami [link] [comments]  ( 43 min )
    [P] I have implemented an RL agent for trading EUR/USD and I don't know what to do next...
    So, after months of learning about RL and doing toy implentations, I have coded a DQN, with experience buffer and dual nets. The network design is like the most average thing you can come across in ML scene. A simple deep feed forward with Relu and Linear as activation functions. I have also coded a simplified version of the Forex market for my agent to train in. It has bid ask prices, leverage, call margin, and buy/sell/not-in-the-market positions. The whole given state to the model is nothing fancy. It is merely the historical, model's balance and a few binary indictors about the environment. Since I'm cripplingly poor, I don't have any specialized hardware for training the model. After burning like 100 hours into the free version of Google collab with three different learning rates I…  ( 45 min )
    [Project] I used a new ML algo called "AnimeSR" to restore the Cowboy Bebop movie and up rez it to full 4K. Here's a link to the end result - honestly think it looks amazing! (Video and Model link in post)
    It took me about 46 hours to run this on my 3080 at home. The original files was from the Blu-ray release that was unfortunately pretty poorly done in my opinion. This version really gives it new life I think. Here's a link to the video result to see for yourself: https://vimeo.com/796411232 And a link to the model I used! https://github.com/TencentARC/AnimeSR submitted by /u/VR_Angel [link] [comments]  ( 43 min )
    [P] Looking for string generation GAN
    I have had minimal luck finding documentation on creating or using a premade string generator. It can't be a text generator really... because I am building it for translation from one language to another. I want to teach the generator to produce guesses on what the best translation for a single word would be based on underlying language semantics. I don't need it to be accurate necessarily, just a point-of-reference for observing language and phonetic mechanics. submitted by /u/lullaby876 [link] [comments]  ( 42 min )
    [D] What techniques can I use to tell if a problem is likely enough to be solved by ML so as to justify compiling the dataset?
    I have a problem that if I solve it with ML, I'll make money, with an outside chance of it being a lot of money. Compiling a dataset will take significant work. Are there any techniques that I can apply to let me know if this is going to be worth it? Perhaps there are certain hallmarks that a problem would have if it is likely to be solvable with available data? Maybe something I can do with a small initial dataset? Thanks. submitted by /u/SnuggleWuggleSleep [link] [comments]  ( 43 min )
    [N] Google: An Important Next Step On Our AI Journey
    https://blog.google/technology/ai/bard-google-ai-search-updates/ submitted by /u/EducationalCicada [link] [comments]  ( 50 min )
    [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement
    From the article: Getty Images has filed a lawsuit in the US against Stability AI, creators of open-source AI art generator Stable Diffusion, escalating its legal battle against the firm. The stock photography company is accusing Stability AI of “brazen infringement of Getty Images’ intellectual property on a staggering scale.” It claims that Stability AI copied more than 12 million images from its database “without permission ... or compensation ... as part of its efforts to build a competing business,” and that the startup has infringed on both the company’s copyright and trademark protections. This is different from the UK-based news from weeks ago. submitted by /u/Wiskkey [link] [comments]  ( 44 min )
    Does the high dimensionality of AI systems that model the real world tell us something about the abstract space of ideas? [D]
    Physical world we live in has 4 dimensions, string theory posits like up to 10. It seems like in order to successfully model the abstract space of ideas which relates things in the physical world to each other and describes them, machine learning needs thousands of dimensions. Also to the extent that ML algos/matrices can be made sparse, that seems to me to tell us something about the density of the mapping between abstract space and physical space... anyone know any papers w/this line of thinking? It also seems a bit unintuitive to me because it seems like geometrically space gets exponentially more complicated as you add dimensions but ML scales linearly or better in many cases with matrix dimensionality. submitted by /u/Frumpagumpus [link] [comments]  ( 44 min )
    [P] Forecasting methods in Time Series
    Hi all! For the longest time, I was having issues understanding how to use time series to do forecasting. Over the last few weeks, I have been writing a series of posts to guide anyone through the process! I am also in the process of writing a detailed practical guide with step-by-step instructions. ​ Right now I have 6 articles on the topic: * Introduction to ARIMA models (https://mlpills.dev/time-series/introduction-to-arima-models/) * Parameters selection in ARIMA models (https://mlpills.dev/time-series/parameters-selection-in-arima-models/) * Seasonal ARIMA (https://mlpills.dev/time-series/seasonal-arima/) * ARCH / GARCH models for Time Series (https://mlpills.dev/time-series/arch-garch-models-for-time-series/) * ARIMA-GARCH models (https://mlpills.dev/time-series/arima-garch-models/) * And today's -> Forecasting in Time Series (https://mlpills.dev/time-series/forecasting-in-time-series/) Let me know if there are any topics that you would like me to cover in the future! submitted by /u/daansan-ml [link] [comments]  ( 43 min )
    Which strategies,framework and applications tools can be implement to automatically monitor the health of the machine learning model? [D]
    Machine LearningModels when deployed in the production environment, model degradation can arise where their output will change if the relationship between the incoming serving data and the predicted target drifts apart. Please can someone briefly elaborate on what strategies, frameworks and application tools can be implemented to automatically monitor the health of the model and alert the Data Scientist of any decay in data quality, data drift, and model quality? submitted by /u/astronaut1971 [link] [comments]  ( 42 min )
    [P] I made image clustering and captioning tools
    I made an image captioning and clustering tools for computer vision and diffusion projects. You can run almost everything automatically and with a simple CLI command. All contributions are welcome. https://github.com/cobanov/image-clustering https://github.com/cobanov/image-captioning submitted by /u/metover [link] [comments]  ( 42 min )
    High-speed cameras and deep learning [Research]
    I haven’t been able to find research on deep learning using high-speed cameras that capture images at frame rates higher than 250fps. I wonder if they are rather useless for image/video processing or do any of you have any ideas about potential applications. submitted by /u/A15L [link] [comments]  ( 44 min )
    [R] Research trends in Graph Neural Networks (GNN)
    Deep connections discovered between Graph Diffusion Networks and Partial Differential Equations modelling heat transfer. https://towardsdatascience.com/graph-neural-networks-as-neural-diffusion-pdes-8571b8c0c774 https://arxiv.org/abs/2106.10934 Strange connections uncovered between GNNs and Structural Causal Models. https://arxiv.org/abs/2109.04173 https://www.youtube.com/watch?v=XC-Bfg3dO0I GNNs used to enhance the factualness of LLMs by providing embeddings from Knowledge Graphs (KEs). https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00360/98089 GNNs used to categorize objects from only their 3D mesh. https://arxiv.org/pdf/2106.15778.pdf Prediction of intuitive physics among physical objects. https://proceedings.neurips.cc/paper/2016/hash/3147da8ab4a0437c15ef51a5cc7f2dc4-Abstract.html Zero-shot generalization in robot Task Planning. https://arxiv.org/abs/2102.13177 https://www.youtube.com/watch?v=POxaTDAj7aY submitted by /u/moschles [link] [comments]  ( 42 min )
    [R] Creating a Large Language Model of a Philosopher
    Paper : https://arxiv.org/abs/2302.01339 Abstract : Can large language models be trained to produce philosophical texts that are difficult to distinguish from texts produced by human philosophers? To address this question, we fine-tuned OpenAI's GPT-3 with the works of philosopher Daniel C. Dennett as additional training data. To explore the Dennett model, we asked the real Dennett ten philosophical questions and then posed the same questions to the language model, collecting four responses for each question without cherry-picking. We recruited 425 participants to distinguish Dennett's answer from the four machine-generated answers. Experts on Dennett's work (N = 25) succeeded 51% of the time, above the chance rate of 20% but short of our hypothesized rate of 80% correct. For two of the ten questions, the language model produced at least one answer that experts selected more frequently than Dennett's own answer. Philosophy blog readers (N = 302) performed similarly to the experts, while ordinary research participants (N = 98) were near chance distinguishing GPT-3's responses from those of an "actual human philosopher". submitted by /u/starstruckmon [link] [comments]  ( 44 min )
    [P] I Made a Text Bot Powered by ChatGPT, DALLE 2, and Wolfram Alpha
    submitted by /u/ImplodingCoding [link] [comments]  ( 43 min )
    [R] deep learning and session-specific rapid recalibration for dynamic hand gesture recognition from EMG
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 44 min )
    [D] AtheneWins just showcased an AI streamer bot, Does anyone know how he did this?
    submitted by /u/imagoons [link] [comments]  ( 42 min )
    [D] RNN and S4 etc
    Hello what's the state of modern RNNs, why does S4 not use nonlinearity on the state vector? What happened to unitary RNN or independent RNN (which sounds like exponential moving average)? submitted by /u/windoze [link] [comments]  ( 42 min )
  • Open

    Automating the math for decision-making under uncertainty
    A new tool brings the benefits of AI programming to a much broader class of problems.  ( 8 min )
  • Open

    Create powerful self-service experiences with Amazon Lex on Talkdesk CX Cloud contact center
    This blog post is co-written with Bruno Mateus, Jonathan Diedrich and Crispim Tribuna at Talkdesk. Contact centers are using artificial intelligence (AI) and natural language processing (NLP) technologies to build a personalized customer experience and deliver effective self-service support through conversational bots. This is the first of a two-part series dedicated to the integration of […]  ( 8 min )
    Image classification model selection using Amazon SageMaker JumpStart
    Researchers continue to develop new model architectures for common machine learning (ML) tasks. One such task is image classification, where images are accepted as input and the model attempts to classify the image as a whole with object label outputs. With many models available today that perform this image classification task, an ML practitioner may […]  ( 11 min )
  • Open

    Is there any value in having non uniform activation functions in the hidden layer?
    I wrote my first neural network implementation following a tutorial. I was amazed that it was less than 140 lines of code. A thought occurred to me and I brought it to chatgpt first, but I'd like some human input from people who know better. I'm wondering what the pros and cons would be for having a network where the hidden layer chooses an activation function when it is initialized that may be different per node. Chatgpt was polite in its responses but it seems very clear that this is a bad idea for most common goals with a neural network. I'm making little creatures that just kind of exist, adding more inputs and outputs as I think of them. So I wondered recently \1. Would there be any benefit from 1 node using a sigmoid and another node using relu? \2. I know it would really screw up the network, but what would happen if I also had a small chance for the activation function to change in a node when creating the next generation? It's baby's first simulation where the best nodes survive to make a new generation. Since I'm using a sort of evolution model I'm not worried about "bad" training results or overly complicated brains, the 2 points chatgpt kept stressing no matter how I asked it. Does anyone have some thoughts they can share on this? It's possible that the question is hard to answer because it's so stupid / pointless for what a neural network would be used for. I get that impression from stack overflow answers to similar questions, hah. I don't have a goal, I just want to watch the little critters exist and struggle. submitted by /u/Tomnnn [link] [comments]  ( 44 min )
    Create a fake person
    Hi. I don’t know if it is a good place to ask this question but i have to make a project for my uni and i am a bit confused. The topic is „creating fake avatar”. I need to create a fake person based on other people images and make it „alive”. By alive I mean that i can create many pictures of this person in different situations. I know i need to use GAN but I just can’t get my head around on how to do it. I mean. First i need a neural network to create a fake person. But how to use this fake person to create different scenarios? Thank you for any help in advance. submitted by /u/Acrobatic_Ad6507 [link] [comments]  ( 42 min )
  • Open

    It’s No Big Deal, but ChatGPT Changes Everything – Part III
    “I’ll tell you the problem with the scientific power that you’re using here: it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part III The post It’s No Big Deal, but ChatGPT Changes Everything – Part III appeared first on Data Science Central.  ( 24 min )
    Guide to Best Flutter State Management Libraries for 2023
    Flutter is a free, open-source mobile user interface (UI) framework that Google developed in 2017. It allows users to create native mobile applications using a single codebase. Using a single codebase and one programming language, users can develop apps for two different platforms, Android and iOS. It’s considered the most effective cross-platform framework available. Flutter… Read More »Guide to Best Flutter State Management Libraries for 2023 The post Guide to Best Flutter State Management Libraries for 2023 appeared first on Data Science Central.  ( 21 min )
    Ensuring Data Security in Realtime Operating System (RTOS) Devices
    Just a few days ago, January 28, we celebrated Data Protection Day, an international event aimed at promoting data privacy and security. In line with the goal of raising awareness about data protection, it would be a good time to discuss data security with Realtime Operating System. This unconventional operating system is widely used, so… Read More »Ensuring Data Security in Realtime Operating System (RTOS) Devices The post Ensuring Data Security in Realtime Operating System (RTOS) Devices appeared first on Data Science Central.  ( 21 min )
  • Open

    AI Joins Hunt for ET: Study Finds 8 Potential Alien Signals
    A University of Toronto undergrad among an international team of researchers unleashing deep learning in the search for extraterrestrial civilizations.  ( 6 min )

  • Open

    [D] Overview of of Chatbot Research?
    Is there a good overview of the state of chatbot research? I'm wondering if the ChatGPT approach of big LLM + RLHF is now considered the only way forward? How about alternatives like BlenderBot3? And what are the best open source chatbots right now? Or if you can't create your own ChatGPT, how does using a GPT3 sized model + prompt engineering compare to smaller models with supervised fine tuning on a conversation dataset? submitted by /u/renbid [link] [comments]  ( 42 min )
    [D] Large language models (LLM) as priority / conflict resolver for embodied AI or in general
    I wanted to discuss the possibilities to use LLM in generating answer based on the context and resolving conflict. Some recent work leveraging LLM in robotics planning, like Language Models as Zero-Shot Planner use LLM to generate plans for robot. What are your views in terms of LLM which leverage the background knowledge and visual clues together to generate correct next action by robots or embodied systems. As a human we decide actions based on resolving priority or conflict based on rules/ concepts , can LLM takes these rules /concept explicitly in decision making to generate new set of actions? Example: while chopping the veggies by robots, if hand comes in between then robot will stop the chopping process of veggies. As chopping task and human hand presence are in conflict and humans hand safety is of higher priority than cutting. How such small-small kind of knowledge be encoded in these robotics system which makes them more safer and trustworthy in general. As LLM requires larges corpus of knowledge/data. submitted by /u/projekt_treadstone [link] [comments]  ( 43 min )
    [D] Is there a database of English language tokens, including all dictionary words and common word segments?
    I find it odd that I have to regenerate this from my input set each time. It should be something we can just start with pre-created. submitted by /u/MrOfficialCandy [link] [comments]  ( 42 min )
    [D] Is English the optimal language to train NLP models on?
    While the greatest amount of training content is available for English at the moment, it seems unlikely to me that it's an efficient language to train AI. A more optimal language would reduce training time and model size. It might, for example, be much more efficient to train AI on Chinese, Korean, or Japanese due to a reduce grammatical token-set when constructing sentences/ideas. But taking the idea further, I wonder if we should be using a human language at all. Perhaps it's more efficient to use something altogether new in order to both communicate with AI more exactingly and also to reduce model size/training. What do y'all think? submitted by /u/MrOfficialCandy [link] [comments]  ( 44 min )
    [R] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 42 min )
    [P] Interactive Map of NeurIPS Proceedings 1987-2022
    submitted by /u/NomicAI [link] [comments]  ( 42 min )
    [P] I made CoPilot for writing LaTeX (in Overleaf) - what do you think?
    submitted by /u/alistairmcleay [link] [comments]  ( 42 min )
    [P] NeuralFit now allows evolution of recurrent neural networks
    Hi all! Some time ago I made a post about NeuralFit (https://github.com/neural-fit/neuralfit), which allows you to evolve neural networks in 🐍 Python with just a few lines of code. Good news: it is now possible to evolve recurrent models on timeseries, which is useful for doing stock market predictions for example 📈. There is currently a simple example, but more examples will be added soon! In addition: the free limitations have been increased, so you can use NeuralFit for most hobby projects without needing a license. Just like last time, feedback is immensely appreciated! submitted by /u/wagenaartje [link] [comments]  ( 43 min )
    [R] Can anyone direct me to academic sources arguing that Big Tech using AI for targeted Social Media ads is a good thing for actual users?
    Been struggling to find sources relating to this, it’s mostly just tech websites or blogs I keep coming across. I’m struggling to find any academic papers arguing for specifically the use of user data to create targeted ads. submitted by /u/lara_lara24 [link] [comments]  ( 43 min )
    [D] How Machine Learning is Transforming Cybersecurity
    submitted by /u/DenofBlerds [link] [comments]  ( 42 min )
    [R] [D] PADL: Language-Directed Physics-Based Character Control by NVIDIA
    submitted by /u/WarmFormal9881 [link] [comments]  ( 42 min )
    [R] [D] The New XOR Problem
    submitted by /u/shawntan [link] [comments]  ( 45 min )
    [P] I made a browser extension that uses ChatGPT to answer every StackOverflow question
    submitted by /u/jsonathan [link] [comments]  ( 46 min )
    Are PhDs in statistics useful for ML research? [D]
    How much do research labs or research jobs in ML hire statisticians vs computer scientists or mathematicians? submitted by /u/AdFew4357 [link] [comments]  ( 44 min )
    Why not use Stable Diffusion’s VAE to get textual embeddings? [D]
    submitted by /u/sudo_fuck_you [link] [comments]  ( 42 min )
    [D] Does the M2 Max 30-core GPU have any advantage over M2 Pro 19-core GPU in Machine Learning Tasks?
    submitted by /u/dona6603 [link] [comments]  ( 42 min )
    [P] tradeslyPro - AI Roboadvisor
    submitted by /u/mrtkp9993 [link] [comments]  ( 42 min )
    [D] List of Large Language Models to play with.
    Hello! I'm trying to understand what available LLMs one can "relatively easily" play with. My goal is to understand the landscape since I haven't worked in this field before. I'm trying to run them "from the largest to the smallest". By "relatively easy", I mean doesn't require to setup a GPU cluster or costs more than $20:) Here are some examples I have found so far: ChatGPT (obviously) - 175B params OpenAI api to access GPT-3s (from ada (0.5B) to davinci (175B)). Also CodeX Bloom (176B) - text window on that page seems to work reliably, you just need to keep pressing "generate" OPT-175B (Facebook LLM), the hosting works surprisingly fast, but slower than ChatGPT Several models on HuggingFace that I made to run with Colab Pro subscription: GPT-NeoX 20B, Flan-t5-xxl 11B, Xlm-roberta-xxl 10.7B, GPT-j 6B. I spent about $20 total on running the models below. None of the Hugging face API interfaces/spaces didn't work for me :(. Here is an example notebook I made for NeoX. Does anyone know more models that are easily accessible? P.S. Some large models I couldn't figure out (yet) how to run easily: Galactica-120b 120B Opt-30b 30B submitted by /u/sinavski [link] [comments]  ( 44 min )
    [N] "I got access to Google LaMDA, the Chatbot that was so realistic that one Google engineer thought it was conscious. First impressions"
    Tweet thread: https://twitter.com/WholeMarsBlog/status/1622139178439036928 First impressions: this sucks ass I can only ask about dogs and a few different types of prompts Does anyone else have experiences to share with this nerfed LaMDA beta google released? submitted by /u/That_Violinist_18 [link] [comments]  ( 44 min )
    Objects Color Matching against a Reference Standard (ColorCODEX)? [D]
    Im trying to build and train a Machine Learning model that autonomously performs color matching between the target gemstone and the Reference Standard color chart. A digital photo image of the target gemstone is first captured in a controlled environment in terms of illumination and background. This digital image is further pre-processed and fed into an algorithm that recognizes and match its color distribution to the closest color in the Reference Standard color chart. Numerous Reference Standards exist but I will use the ColorCODEX (this link ColorCODEX) So I would like to know which Machine Learning Model to use in this case to ensure high matching accuracy and like what performance metric can I use to measure matching accuracy and the color space for the color model. And at the end what image pre-processing needs to be done? I found this article (https://www.atlantis-press.com/proceedings/icosat-17/25895985)with backpropagation NN but not sure if it the best choice. Any other option? submitted by /u/astronaut1971 [link] [comments]  ( 44 min )
    [R] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 42 min )
    [D] GNN Is node information required ?
    [D] Hey, It's kind of a simple one but just putting it out for opinions: when passing a graph through graph neural networks to obtain vectors for all the nodes. Is the info in the node required because all we care is about the position of certain node in context to the whole graph and that's how gnn outputs the vectors of each node. Sorry if that was messy. submitted by /u/ab_11nav [link] [comments]  ( 42 min )
    What text to speech does this guy use? [R]
    https://youtu.be/ktdUeqzzhiA what text to speech does he use? he's been popping up on my yt feed lately and i can see he has different voices in his videos and most of them sound robotic, what do you think it's being used here? submitted by /u/candidhorse4 [link] [comments]  ( 42 min )
  • Open

    Fullmetal Alchemist as an 80's Dark Fantasy movie (Chapters in Descripti...
    submitted by /u/EIDANart [link] [comments]  ( 40 min )
    AI text generator for news
    Hey all! For an art installation I want to be able to generate war news in specific countries and have a specific word used in it. For instance, a 2-3 sentence news about the war in Yemen with the word "love" in it. Any AI platforms you can suggest? submitted by /u/jarjar_bigh [link] [comments]  ( 41 min )
    GPT3-Assisted Google search, document/video/audio/website/youtube video indexer and composer conveniently built into Discord!
    submitted by /u/yikeshardware [link] [comments]  ( 42 min )
    AI Ethics, Deep Fakes, and the Dark Side of the Algorithm
    submitted by /u/IndependenceFun4627 [link] [comments]  ( 40 min )
    AI Dream 156 - This EPIC AI Video might break Youtube
    submitted by /u/LordPewPew777 [link] [comments]  ( 41 min )
    This web can make voices only with AI
    https://voice.ai/r/AfpKl. This web can create very good quality AI voices. They have majorly famous people voices. For using this voices you only need to write something and then the AI will say it with the voice you choose. submitted by /u/Marcosio9083 [link] [comments]  ( 41 min )
    Most advanced AI for clothing try-on?
    There have been already some ai developers teamed up with big firms to create AI for virtual clothing try-on even few years ago, but considering lastest models can generate images of very realistic human (including garments on them) from nothing I suppose there must be practical ones that can generate better quality try-on image given a series of photos of a model and clothes to put on. Do you know any such models? submitted by /u/PMATHbreaker [link] [comments]  ( 41 min )
    Is Microsoft's Azure OpenAI service generally 'cost effective' for a small number of specialist users in a small business?
    Imagine a type of user that is a senior manager, they do not have an analysis background. What i'd like is for them to be able to tell a dashboard "what happened to our customer reviews last week?" and a text summary is generated and parameters to charts sent out, and charts generated I'm exploring the aspect of cost here I read Azure's OpenAI marketing and see: > Azure OpenAI Service is enabling customers across industries from health care to financial services to manufacturing to quickly perform an array of tasks. Innovations include generating unique content for customers, summarizing and classifying customer feedback, and extracting text from medical records to streamline billing. The most common uses have been writing assistance, translating natural language to code and gaining data insights through search, entity extraction, sentiment and classification. Great But the pricing looks expensive for Base Series Fine-tuned, Curie: > Inferencing per 1,000 tokens | $0.002 > Hosting per hour | $0.24 and so i'd need to think about some caching intermediary stage to prevent too many calls to the Azure OpenAI API So is using Azure here for an ai based solution cost effective for a small business? submitted by /u/Work_Owl [link] [comments]  ( 42 min )
    Universities are acting like a fish out of the water due to AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    Starting a career in AI
    Starting a career in AI Hello folks, I’m a Product Manager with 10 years of experience in B2C mobile apps. I love mobile apps and I’m really passionate about user experience. However, I lately started using various AI products and I’m really impressed. Since I believe it the future I was thinking if it’s possible to start learning the basics and even make a career change in AI development (as soon as I learn what fields exist). I have studied computer science and I’m familiar with software development. Any idea/recommendations where to start from and what a possible career path would be? Thanks submitted by /u/EquivalentMongoose95 [link] [comments]  ( 42 min )
    Technology Readiness Levels (TRL) in AI development
    How can we move from an idea to production in AI? Does the technology readiness levels (TRL) help? If you want to get some answers please read this article in medium: https://medium.com/towards-artificial-intelligence/technology-readiness-levels-trl-in-ai-development-c6ed1190fbd6 All the ideas are more than welcome! submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 40 min )
    Request: Voice cloning app. (Website or app)
    I am looking for a free voice cloning app or website (Like elevenlabs, but I would need a quota bypass for that) and I can't find any. Any suggestions? submitted by /u/1sydxyz [link] [comments]  ( 40 min )
    I have a question for you? I am in need of specific AI tool I am not aware if exists yet.
    I am searching for an image and I need to find where the image comes from and it can be anywhere on the web or as a part of video as whole so is there an AI you can give a name , the said picture, photo, sentence or a specific data to find the exact photo in a random youtube video, the same name on some old site or inclusion of A photo I have on another site where it may have originated. An Ai that will scan the entire youtube , watch every vid and then list all the instances where the image was used. submitted by /u/Plajomzn [link] [comments]  ( 42 min )
    ChatGPT Becomes Fastest Growing App, Beating TikTok In Popularity
    submitted by /u/liquidocelotYT [link] [comments]  ( 42 min )
    The Fast and the Furious come to life with Midjourney as an 80's Film
    submitted by /u/barrese87 [link] [comments]  ( 40 min )
    Ai that let’s you Play As Neo in the matrix, in an open ended movie experience
    submitted by /u/techmanj [link] [comments]  ( 40 min )
    Why Overfitting and Underfitting Happen
    Hi guys, I have made a video on YouTube here where I explain why underfitting and overfitting happen in machine learning models by looking at the fundamental theory behind bias variance trade-off. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    How will AI affect Day-to-day business?
    https://buddingmanager.com/2023/01/26/https-buddingmanager-com-artificialintelligenceinbusiness/ submitted by /u/OriginalRecklessPark [link] [comments]  ( 40 min )
    🌎 Make your best prediction: HOW will AI systems change the world in the coming 10 years? What will be different 10 years later, because of AI systems like ChatGPT, Midjourney, Codex, Whisper and others?
    submitted by /u/DrMelbourne [link] [comments]  ( 43 min )
    I created a stream where AI bots watch movies and deliver a running commentary
    Hi all, For my weekend project I figured I would build an AI driven spiritual successor to Mystery Science Theater 3000... Stop on by and watch the AI characters watch movies and make comments! Today they are watching "The House on Haunted Hill" and "Plan 9 From Outer Space." There's still a lot to do but I'm excited to play around with this more and see how it plays out and would love some feedback! https://twitch.tv/MysteryAItheater submitted by /u/caseigl [link] [comments]  ( 42 min )
    AI That Can Put Objects from Different Pictures Together?
    I'm trying to put together pictures of my best friend (he unfortunately passed away) and myself together. We don't have any pictures of us together since our childhood /teen days. I would like to put our individually taken pics together using AI. Can someone recommend any tools? submitted by /u/mustufa2020 [link] [comments]  ( 41 min )
    Amazing "Jailbreak" Bypasses ChatGPT's Ethics Safeguards
    submitted by /u/Mental_Character7367 [link] [comments]  ( 54 min )
    Breaking: Google Invests in AnthropicAI and Claude with $300 Million Round for 10 Percent of the A.I. Lab valued at $5 Billion
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
  • Open

    Getty Images v. Stability AI – Lawsuit filing
    submitted by /u/nickb [link] [comments]  ( 40 min )
  • Open

    How to teach the agent to arrive at the goal by creating a search pattern
    Hi all, assuming the goal is to reach a ball on the table. The reward function used for this task is often: d= norm( gripper_position - ball_position ) , which will solve the problem. However, how can one teach the agent not to "directly" go to the ball, but creating a search pattern, for example, "scratching the surface with the gripper until you find the ball"? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 43 min )
    Autonomous Driving Off-Road | Swaayatt Robots | Dense Fog
    submitted by /u/shani_786 [link] [comments]  ( 41 min )
  • Open

    Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions. (arXiv:2211.01916v2 [cs.LG] UPDATED)
    We give an improved theoretical analysis of score-based generative modeling. Under a score estimate with small $L^2$ error (averaged across timesteps), we provide efficient convergence guarantees for any data distribution with second-order moment, by either employing early stopping or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in reverse KL divergence in $\epsilon$-accuracy can be done in $\tilde O\left(\frac{d \log (1/\delta)}{\epsilon}\right)$ steps: 1) the variance-$\delta$ Gaussian perturbation of any data distribution; 2) data distributions with $1/\delta$-smooth score functions. Our analysis also provides a quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice.
    Learning PDE Solution Operator for Continuous Modeling of Time-Series. (arXiv:2302.00854v1 [cs.LG])
    Learning underlying dynamics from data is important and challenging in many real-world scenarios. Incorporating differential equations (DEs) to design continuous networks has drawn much attention recently, however, most prior works make specific assumptions on the type of DEs, making the model specialized for particular problems. This work presents a partial differential equation (PDE) based framework which improves the dynamics modeling capability. Building upon the recent Fourier neural operator, we propose a neural operator that can handle time continuously without requiring iterative operations or specific grids of temporal discretization. A theoretical result demonstrating its universality is provided. We also uncover an intrinsic property of neural operators that improves data efficiency and model generalization by ensuring stability. Our model achieves superior accuracy in dealing with time-dependent PDEs compared to existing models. Furthermore, several numerical pieces of evidence validate that our method better represents a wide range of dynamics and outperforms state-of-the-art DE-based models in real-time-series applications. Our framework opens up a new way for a continuous representation of neural networks that can be readily adopted for real-world applications.
    Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function. (arXiv:2302.01244v1 [cs.LG])
    Probabilistic dynamics model ensemble is widely used in existing model-based reinforcement learning methods as it outperforms a single dynamics model in both asymptotic performance and sample efficiency. In this paper, we provide both practical and theoretical insights on the empirical success of the probabilistic dynamics model ensemble through the lens of Lipschitz continuity. We find that, for a value function, the stronger the Lipschitz condition is, the smaller the gap between the true dynamics- and learned dynamics-induced Bellman operators is, thus enabling the converged value function to be closer to the optimal value function. Hence, we hypothesize that the key functionality of the probabilistic dynamics model ensemble is to regularize the Lipschitz condition of the value function using generated samples. To test this hypothesis, we devise two practical robust training mechanisms through computing the adversarial noise and regularizing the value network's spectral norm to directly regularize the Lipschitz condition of the value functions. Empirical results show that combined with our mechanisms, model-based RL algorithms with a single dynamics model outperform those with an ensemble of probabilistic dynamics models. These findings not only support the theoretical insight, but also provide a practical solution for developing computationally efficient model-based RL algorithms.
    Dynamic Recognition of Speakers for Consent Management by Contrastive Embedding Replay. (arXiv:2205.08459v2 [cs.SD] UPDATED)
    Voice assistants overhear conversations and a consent management mechanism is required. Consent management can be implemented using speaker recognition. Users that do not give consent enrol their voice and all their further recordings are discarded. Building speaker recognition-based consent management is challenging as dynamic registration, removal, and re-registration of speakers must be efficiently handled. This work proposes a consent management system addressing the aforementioned challenges. A contrastive based training is applied to learn the underlying speaker equivariance inductive bias. The contrastive features for buckets of speakers are trained a few steps into each iteration and act as replay buffers. These features are progressively selected using a multi-strided random sampler for classification. Moreover, new methods for dynamic registration using a portion of old utterances, removal, and re-registration of speakers are proposed. The results verify memory efficiency and dynamic capabilities of the proposed methods and outperform the existing approach from the literature.
    Live in the Moment: Learning Dynamics Model Adapted to Evolving Policy. (arXiv:2207.12141v2 [cs.LG] UPDATED)
    Model-based reinforcement learning (RL) often achieves higher sample efficiency in practice than model-free RL by learning a dynamics model to generate samples for policy learning. Previous works learn a dynamics model that fits under the empirical state-action visitation distribution for all historical policies, i.e., the sample replay buffer. However, in this paper, we observe that fitting the dynamics model under the distribution for \emph{all historical policies} does not necessarily benefit model prediction for the \emph{current policy} since the policy in use is constantly evolving over time. The evolving policy during training will cause state-action visitation distribution shifts. We theoretically analyze how this distribution shift over historical policies affects the model learning and model rollouts. We then propose a novel dynamics model learning method, named \textit{Policy-adapted Dynamics Model Learning (PDML)}. PDML dynamically adjusts the historical policy mixture distribution to ensure the learned model can continually adapt to the state-action visitation distribution of the evolving policy. Experiments on a range of continuous control environments in MuJoCo show that PDML achieves significant improvement in sample efficiency and higher asymptotic performance combined with the state-of-the-art model-based RL methods.
    Modelling the long-term fairness dynamics of data-driven targeted help on job seekers. (arXiv:2208.08881v2 [cs.LG] UPDATED)
    The use of data-driven decision support by public agencies is becoming more widespread and already influences the allocation of public resources. This raises ethical concerns, as it has adversely affected minorities and historically discriminated groups. In this paper, we use an approach that combines statistics and data-driven approaches with dynamical modeling to assess long-term fairness effects of labor market interventions. Specifically, we develop and use a model to investigate the impact of decisions caused by a public employment authority that selectively supports job-seekers through targeted help. The selection of who receives what help is based on a data-driven intervention model that estimates an individual's chances of finding a job in a timely manner and rests upon data that describes a population in which skills relevant to the labor market are unevenly distributed between two groups (e.g., males and females). The intervention model has incomplete access to the individual's actual skills and can augment this with knowledge of the individual's group affiliation, thus using a protected attribute to increase predictive accuracy. We assess this intervention model's dynamics -- especially fairness-related issues and trade-offs between different fairness goals -- over time and compare it to an intervention model that does not use group affiliation as a predictive feature. We conclude that in order to quantify the trade-off correctly and to assess the long-term fairness effects of such a system in the real-world, careful modeling of the surrounding labor market is indispensable.
    Do Kernel and Neural Embeddings Help in Training and Generalization?. (arXiv:1905.05095v3 [cs.LG] UPDATED)
    Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.
    Learning-To-Ensemble by Contextual Rank Aggregation in E-Commerce. (arXiv:2107.08598v3 [cs.LG] UPDATED)
    Ensemble models in E-commerce combine predictions from multiple sub-models for ranking and revenue improvement. Industrial ensemble models are typically deep neural networks, following the supervised learning paradigm to infer conversion rate given inputs from sub-models. However, this process has the following two problems. Firstly, the point-wise scoring approach disregards the relationships between items and leads to homogeneous displayed results, while diversified display benefits user experience and revenue. Secondly, the learning paradigm focuses on the ranking metrics and does not directly optimize the revenue. In our work, we propose a new Learning-To-Ensemble (LTE) framework RAEGO, which replaces the ensemble model with a contextual Rank Aggregator (RA) and explores the best weights of sub-models by the Evaluator-Generator Optimization (EGO). To achieve the best online performance, we propose a new rank aggregation algorithm TournamentGreedy as a refinement of classic rank aggregators, which also produces the best average weighted Kendall Tau Distance (KTD) amongst all the considered algorithms with quadratic time complexity. Under the assumption that the best output list should be Pareto Optimal on the KTD metric for sub-models, we show that our RA algorithm has higher efficiency and coverage in exploring the optimal weights. Combined with the idea of Bayesian Optimization and gradient descent, we solve the online contextual Black-Box Optimization task that finds the optimal weights for sub-models given a chosen RA model. RA-EGO has been deployed in our online system and has improved the revenue significantly.
    Neural Design for Genetic Perturbation Experiments. (arXiv:2207.12805v2 [q-bio.QM] UPDATED)
    The problem of how to genetically modify cells in order to maximize a certain cellular phenotype has taken center stage in drug development over the last few years (with, for example, genetically edited CAR-T, CAR-NK, and CAR-NKT cells entering cancer clinical trials). Exhausting the search space for all possible genetic edits (perturbations) or combinations thereof is infeasible due to cost and experimental limitations. This work provides a theoretically sound framework for iteratively exploring the space of perturbations in pooled batches in order to maximize a target phenotype under an experimental budget. Inspired by this application domain, we study the problem of batch query bandit optimization and introduce the Optimistic Arm Elimination ($\mathrm{OAE}$) principle designed to find an almost optimal arm under different functional relationships between the queries (arms) and the outputs (rewards). We analyze the convergence properties of $\mathrm{OAE}$ by relating it to the Eluder dimension of the algorithm's function class and validate that $\mathrm{OAE}$ outperforms other strategies in finding optimal actions in experiments on simulated problems, public datasets well-studied in bandit contexts, and in genetic perturbation datasets when the regression model is a deep neural network. OAE also outperforms the benchmark algorithms in 3 of 4 datasets in the GeneDisco experimental planning challenge.
    Summarization Programs: Interpretable Abstractive Summarization with Neural Modular Trees. (arXiv:2209.10492v2 [cs.CL] UPDATED)
    Current abstractive summarization models either suffer from a lack of clear interpretability or provide incomplete rationales by only highlighting parts of the source document. To this end, we propose the Summarization Program (SP), an interpretable modular framework consisting of an (ordered) list of binary trees, each encoding the step-by-step generative process of an abstractive summary sentence from the source document. A Summarization Program contains one root node per summary sentence, and a distinct tree connects each summary sentence (root node) to the document sentences (leaf nodes) from which it is derived, with the connecting nodes containing intermediate generated sentences. Edges represent different modular operations involved in summarization such as sentence fusion, compression, and paraphrasing. We first propose an efficient best-first search method over neural modules, SP-Search that identifies SPs for human summaries by directly optimizing for ROUGE scores. Next, using these programs as automatic supervision, we propose seq2seq models that generate Summarization Programs, which are then executed to obtain final summaries. We demonstrate that SP-Search effectively represents the generative process behind human summaries using modules that are typically faithful to their intended behavior. We also conduct a simulation study to show that Summarization Programs improve the interpretability of summarization models by allowing humans to better simulate model reasoning. Summarization Programs constitute a promising step toward interpretable and modular abstractive summarization, a complex task previously addressed primarily through blackbox end-to-end neural systems. Supporting code available at https://github.com/swarnaHub/SummarizationPrograms
    Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning. (arXiv:2210.00226v2 [cs.LG] UPDATED)
    Federated learning aims to train models collaboratively across different clients without the sharing of data for privacy considerations. However, one major challenge for this learning paradigm is the {\em data heterogeneity} problem, which refers to the discrepancies between the local data distributions among various clients. To tackle this problem, we first study how data heterogeneity affects the representations of the globally aggregated models. Interestingly, we find that heterogeneous data results in the global model suffering from severe {\em dimensional collapse}, in which representations tend to reside in a lower-dimensional space instead of the ambient space. Moreover, we observe a similar phenomenon on models locally trained on each client and deduce that the dimensional collapse on the global model is inherited from local models. In addition, we theoretically analyze the gradient flow dynamics to shed light on how data heterogeneity result in dimensional collapse for local models. To remedy this problem caused by the data heterogeneity, we propose {\sc FedDecorr}, a novel method that can effectively mitigate dimensional collapse in federated learning. Specifically, {\sc FedDecorr} applies a regularization term during local training that encourages different dimensions of representations to be uncorrelated. {\sc FedDecorr}, which is implementation-friendly and computationally-efficient, yields consistent improvements over baselines on standard benchmark datasets. Code: https://github.com/Yujun-Shi/FedCLS.
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v2 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.
    Analysis of Knowledge Transfer in Kernel Regime. (arXiv:2003.13438v3 [cs.LG] UPDATED)
    Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
    A Light Recipe to Train Robust Vision Transformers. (arXiv:2209.07399v2 [cs.CV] UPDATED)
    In this paper, we ask whether Vision Transformers (ViTs) can serve as an underlying architecture for improving the adversarial robustness of machine learning models against evasion attacks. While earlier works have focused on improving Convolutional Neural Networks, we show that also ViTs are highly suitable for adversarial training to achieve competitive performance. We achieve this objective using a custom adversarial training recipe, discovered using rigorous ablation studies on a subset of the ImageNet dataset. The canonical training recipe for ViTs recommends strong data augmentation, in part to compensate for the lack of vision inductive bias of attention modules, when compared to convolutions. We show that this recipe achieves suboptimal performance when used for adversarial training. In contrast, we find that omitting all heavy data augmentation, and adding some additional bag-of-tricks ($\varepsilon$-warmup and larger weight decay), significantly boosts the performance of robust ViTs. We show that our recipe generalizes to different classes of ViT architectures and large-scale models on full ImageNet-1k. Additionally, investigating the reasons for the robustness of our models, we show that it is easier to generate strong attacks during training when using our recipe and that this leads to better robustness at test time. Finally, we further study one consequence of adversarial training by proposing a way to quantify the semantic nature of adversarial perturbations and highlight its correlation with the robustness of the model. Overall, we recommend that the community should avoid translating the canonical training recipes in ViTs to robust training and rethink common training choices in the context of adversarial training.
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v2 [cs.LG] UPDATED)
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These changes result in distribution shifts across environments. Attributing model performance changes to specific shifts is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.
    Safe Optimization of an Industrial Refrigeration Process Using an Adaptive and Explorative Framework. (arXiv:2211.13019v2 [math.OC] UPDATED)
    Many industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown process characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. In this paper, we demonstrate the application of an adaptive and explorative real-time optimization framework to an industrial refrigeration process, where we learn the process characteristics through changes in process control targets and through exploration to satisfy safety constraints. We quantify the uncertainty in unknown compressor characteristics of the refrigeration plant by using Gaussian processes and incorporate this uncertainty into the objective function of the real-time optimization problem as a weighted cost term. We adaptively control the weight of this term to drive exploration. The results of our simulation experiments indicate the proposed approach can help to increase the energy efficiency of the considered refrigeration process, closely approximating the performance of a solution that has complete information about the compressor performance characteristics.
    High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance. (arXiv:2302.00999v1 [math.OC])
    During recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimization methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as the boundedness of the gradient noise variance or of the objective's gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central $\alpha$-th moment for $\alpha \in (1,2]$ in the following setups: (i) smooth non-convex / Polyak-Lojasiewicz / convex / strongly convex / quasi-strongly convex minimization problems, (ii) Lipschitz / star-cocoercive and monotone / quasi-strongly monotone variational inequalities. These results justify the usage of the considered methods for solving problems that do not fit standard functional classes studied in stochastic optimization.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v5 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    Multi-agent Reinforcement Learning with Graph Q-Networks for Antenna Tuning. (arXiv:2302.01199v1 [cs.NI])
    Future generations of mobile networks are expected to contain more and more antennas with growing complexity and more parameters. Optimizing these parameters is necessary for ensuring the good performance of the network. The scale of mobile networks makes it challenging to optimize antenna parameters using manual intervention or hand-engineered strategies. Reinforcement learning is a promising technique to address this challenge but existing methods often use local optimizations to scale to large network deployments. We propose a new multi-agent reinforcement learning algorithm to optimize mobile network configurations globally. By using a value decomposition approach, our algorithm can be trained from a global reward function instead of relying on an ad-hoc decomposition of the network performance across the different cells. The algorithm uses a graph neural network architecture which generalizes to different network topologies and learns coordination behaviors. We empirically demonstrate the performance of the algorithm on an antenna tilt tuning problem and a joint tilt and power control problem in a simulated environment.
    Robust Estimation under the Wasserstein Distance. (arXiv:2302.01237v1 [stat.ML])
    We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.
    From Traditional Adaptive Data Caching to Adaptive Context Caching: A Survey. (arXiv:2211.11259v2 [cs.HC] UPDATED)
    Context data is in demand more than ever with the rapid increase in the development of many context-aware Internet of Things applications. Research in context and context-awareness is being conducted to broaden its applicability in light of many practical and technical challenges. One of the challenges is improving performance when responding to large number of context queries. Context Management Platforms that infer and deliver context to applications measure this problem using Quality of Service (QoS) parameters. Although caching is a proven way to improve QoS, transiency of context and features such as variability, heterogeneity of context queries pose an additional real-time cost management problem. This paper presents a critical survey of state-of-the-art in adaptive data caching with the objective of developing a body of knowledge in cost- and performance-efficient adaptive caching strategies. We comprehensively survey a large number of research publications and evaluate, compare, and contrast different techniques, policies, approaches, and schemes in adaptive caching. Our critical analysis is motivated by the focus on adaptively caching context as a core research problem. A formal definition for adaptive context caching is then proposed, followed by identified features and requirements of a well-designed, objective optimal adaptive context caching strategy.
    Uncertainty in Fairness Assessment: Maintaining Stable Conclusions Despite Fluctuations. (arXiv:2302.01079v1 [cs.LG])
    Several recent works encourage the use of a Bayesian framework when assessing performance and fairness metrics of a classification algorithm in a supervised setting. We propose the Uncertainty Matters (UM) framework that generalizes a Beta-Binomial approach to derive the posterior distribution of any criteria combination, allowing stable performance assessment in a bias-aware setting.We suggest modeling the confusion matrix of each demographic group using a Multinomial distribution updated through a Bayesian procedure. We extend UM to be applicable under the popular K-fold cross-validation procedure. Experiments highlight the benefits of UM over classical evaluation frameworks regarding informativeness and stability.
    A Survey on Efficient Training of Transformers. (arXiv:2302.01107v1 [cs.LG])
    Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v3 [stat.ML] UPDATED)
    This paper presents new machine learning approaches to approximate the solutions of optimal stopping problems. The key idea of these methods is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees are provided. Our randomized reinforcement learning approach and randomized recurrent neural network approach outperform the state-of-the-art and other relevant machine learning approaches in Markovian and non-Markovian examples, respectively. In particular, we test our approaches on Black-Scholes, Heston, rough Heston and fractional Brownian motion. Moreover, we show that they can also be used to efficiently compute Greeks of American options.
    Epistemic Neural Networks. (arXiv:2107.08924v7 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as an interface for models that produce joint predictions.
    LogLG: Weakly Supervised Log Anomaly Detection via Log-Event Graph Construction. (arXiv:2208.10833v4 [cs.SE] UPDATED)
    Fully supervised log anomaly detection methods suffer the heavy burden of annotating massive unlabeled log data. Recently, many semi-supervised methods have been proposed to reduce annotation costs with the help of parsed templates. However, these methods consider each keyword independently, which disregards the correlation between keywords and the contextual relationships among log sequences. In this paper, we propose a novel weakly supervised log anomaly detection framework, named LogLG, to explore the semantic connections among keywords from sequences. Specifically, we design an end-to-end iterative process, where the keywords of unlabeled logs are first extracted to construct a log-event graph. Then, we build a subgraph annotator to generate pseudo labels for unlabeled log sequences. To ameliorate the annotation quality, we adopt a self-supervised task to pre-train a subgraph annotator. After that, a detection model is trained with the generated pseudo labels. Conditioned on the classification results, we re-extract the keywords from the log sequences and update the log-event graph for the next iteration. Experiments on five benchmarks validate the effectiveness of LogLG for detecting anomalies on unlabeled log data and demonstrate that LogLG, as the state-of-the-art weakly supervised method, achieves significant performance improvements compared to existing methods.
    OntoED: Low-resource Event Detection with Ontology Embedding. (arXiv:2105.10922v4 [cs.IR] CROSS LISTED)
    Event Detection (ED) aims to identify event trigger words from a given text and classify it into an event type. Most of current methods to ED rely heavily on training instances, and almost ignore the correlation of event types. Hence, they tend to suffer from data scarcity and fail to handle new unseen event types. To address these problems, we formulate ED as a process of event ontology population: linking event instances to pre-defined event types in event ontology, and propose a novel ED framework entitled OntoED with ontology embedding. We enrich event ontology with linkages among event types, and further induce more event-event correlations. Based on the event ontology, OntoED can leverage and propagate correlation knowledge, particularly from data-rich to data-poor event types. Furthermore, OntoED can be applied to new unseen event types, by establishing linkages to existing ones. Experiments indicate that OntoED is more predominant and robust than previous approaches to ED, especially in data-scarce scenarios.
    A general Markov decision process formalism for action-state entropy-regularized reward maximization. (arXiv:2302.01098v1 [cs.LG])
    Previous work has separately addressed different forms of action, state and action-state entropy regularization, pure exploration and space occupation. These problems have become extremely relevant for regularization, generalization, speeding up learning and providing robust solutions at unprecedented levels. However, solutions of those problems are hectic, ranging from convex and non-convex optimization, and unconstrained optimization to constrained optimization. Here we provide a general dual function formalism that transforms the constrained optimization problem into an unconstrained convex one for any mixture of action and state entropies. The cases with pure action entropy and pure state entropy are understood as limits of the mixture.
    Learning Globally Smooth Functions on Manifolds. (arXiv:2210.00301v3 [cs.LG] UPDATED)
    Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. This work combines techniques from semi-infinite constrained learning and manifold regularization to learn representations that are globally smooth on a manifold. To do so, it shows that under typical conditions the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. It is shown that under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Experiments on real world data illustrate the advantages of the proposed method relative to existing alternatives.
    Boundary-Aware Uncertainty for Feature Attribution Explainers. (arXiv:2210.02419v3 [cs.LG] UPDATED)
    Post-hoc explanation methods have become a critical tool for understanding black-box classifiers in high-stakes applications, precipitating a need for reliable explanations. Nevertheless, recent works have shown that many existing methods can be inconsistent or lack robustness. In addition, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. In this work, we propose a novel geodesic-based kernel which captures the complexity of the target black-box decision boundary. We show theoretically that the proposed kernel similarity increases with the complexity of the decision boundary. In addition, we introduce the Gaussian Process Explanation UnCertainty (GPEC) framework, which generates a unified uncertainty estimate combining decision boundary-aware uncertainty with existing explanation uncertainty methods. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method. Empirical results on multiple tabular and image datasets show that the GPEC uncertainty estimate improves understanding of explanations as compared to existing methods.
    Encouraging Intra-Class Diversity Through a Reverse Contrastive Loss for Better Single-Source Domain Generalization. (arXiv:2106.07916v2 [cs.CV] CROSS LISTED)
    Traditional deep learning algorithms often fail to generalize when they are tested outside of the domain of the training data. The issue can be mitigated by using unlabeled data from the target domain at training time, but because data distributions can change dynamically in real-life applications once a learned model is deployed, it is critical to create networks robust to unknown and unforeseen domain shifts. In this paper we focus on one of the reasons behind the inability of neural networks to be so: deep networks focus only on the most obvious, potentially spurious, clues to make their predictions and are blind to useful but slightly less efficient or more complex patterns. This behaviour has been identified and several methods partially addressed the issue. To investigate their effectiveness and limits, we first design a publicly available MNIST-based benchmark to precisely measure the ability of an algorithm to find the ''hidden'' patterns. Then, we evaluate state-of-the-art algorithms through our benchmark and show that the issue is largely unsolved. Finally, we propose a partially reversed contrastive loss to encourage intra-class diversity and find less strongly correlated patterns, whose efficiency is demonstrated by our experiments.
    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models. (arXiv:2205.07999v2 [stat.ML] UPDATED)
    Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v2 [cs.LG] UPDATED)
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.
    Mini-Batch Learning Strategies for modeling long term temporal dependencies: A study in environmental applications. (arXiv:2210.08347v2 [cs.LG] UPDATED)
    In many environmental applications, recurrent neural networks (RNNs) are often used to model physical variables with long temporal dependencies. However, due to mini-batch training, temporal relationships between training segments within the batch (intra-batch) as well as between batches (inter-batch) are not considered, which can lead to limited performance. Stateful RNNs aim to address this issue by passing hidden states between batches. Since Stateful RNNs ignore intra-batch temporal dependency, there exists a trade-off between training stability and capturing temporal dependency. In this paper, we provide a quantitative comparison of different Stateful RNN modeling strategies, and propose two strategies to enforce both intra- and inter-batch temporal dependency. First, we extend Stateful RNNs by defining a batch as a temporally ordered set of training segments, which enables intra-batch sharing of temporal information. While this approach significantly improves the performance, it leads to much larger training times due to highly sequential training. To address this issue, we further propose a new strategy which augments a training segment with an initial value of the target variable from the timestep right before the starting of the training segment. In other words, we provide an initial value of the target variable as additional input so that the network can focus on learning changes relative to that initial value. By using this strategy, samples can be passed in any order (mini-batch training) which significantly reduces the training time while maintaining the performance. In demonstrating our approach in hydrological modeling, we observe that the most significant gains in predictive accuracy occur when these methods are applied to state variables whose values change more slowly, such as soil water and snowpack, rather than continuously moving flux variables such as streamflow.
    An Advantage Using Feature Selection with a Quantum Annealer. (arXiv:2211.09756v3 [quant-ph] UPDATED)
    Feature selection is a technique in statistical prediction modeling that identifies features in a record with a strong statistical connection to the target variable. Excluding features with a weak statistical connection to the target variable in training not only drops the dimension of the data, which decreases the time complexity of the algorithm, it also decreases noise within the data which assists in avoiding overfitting. In all, feature selection assists in training a robust statistical model that performs well and is stable. Given the lack of scalability in classical computation, current techniques only consider the predictive power of the feature and not redundancy between the features themselves. Recent advancements in feature selection that leverages quantum annealing (QA) gives a scalable technique that aims to maximize the predictive power of the features while minimizing redundancy. As a consequence, it is expected that this algorithm would assist in the bias/variance trade-off yielding better features for training a statistical model. This paper tests this intuition against classical methods by utilizing open-source data sets and evaluate the efficacy of each trained statistical model well-known prediction algorithms. The numerical results display an advantage utilizing the features selected from the algorithm that leveraged QA.
    Optimization-Based Separations for Neural Networks. (arXiv:2112.02393v3 [cs.LG] UPDATED)
    Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known results in which the deeper architecture leverages this advantage into a provable optimization guarantee. We prove that when the data are generated by a distribution with radial symmetry which satisfies some mild assumptions, gradient descent can efficiently learn ball indicator functions using a depth 2 neural network with two layers of sigmoidal activations, and where the hidden layer is held fixed throughout training. By building on and refining existing techniques for approximation lower bounds of neural networks with a single layer of non-linearities, we show that there are $d$-dimensional radial distributions on the data such that ball indicators cannot be learned efficiently by any algorithm to accuracy better than $\Omega(d^{-4})$, nor by a standard gradient descent implementation to accuracy better than a constant. These results establish what is to the best of our knowledge, the first optimization-based separations where the approximation benefits of the stronger architecture provably manifest in practice. Our proof technique introduces new tools and ideas that may be of independent interest in the theoretical study of both the approximation and optimization of neural networks.
    Neural Estimation of the Rate-Distortion Function With Applications to Operational Source Coding. (arXiv:2204.01612v2 [cs.IT] UPDATED)
    A fundamental question in designing lossy data compression schemes is how well one can do in comparison with the rate-distortion function, which describes the known theoretical limits of lossy compression. Motivated by the empirical success of deep neural network (DNN) compressors on large, real-world data, we investigate methods to estimate the rate-distortion function on such data, which would allow comparison of DNN compressors with optimality. While one could use the empirical distribution of the data and apply the Blahut-Arimoto algorithm, this approach presents several computational challenges and inaccuracies when the datasets are large and high-dimensional, such as the case of modern image datasets. Instead, we re-formulate the rate-distortion objective, and solve the resulting functional optimization problem using neural networks. We apply the resulting rate-distortion estimator, called NERD, on popular image datasets, and provide evidence that NERD can accurately estimate the rate-distortion function. Using our estimate, we show that the rate-distortion achievable by DNN compressors are within several bits of the rate-distortion function for real-world datasets. Additionally, NERD provides access to the rate-distortion achieving channel, as well as samples from its output marginal. Therefore, using recent results in reverse channel coding, we describe how NERD can be used to construct an operational one-shot lossy compression scheme with guarantees on the achievable rate and distortion. Experimental results demonstrate competitive performance with DNN compressors.
    Information-theoretic limitations of data-based price discrimination. (arXiv:2204.12723v2 [cs.GT] UPDATED)
    This paper studies third-degree price discrimination (3PD) based on a random sample of valuation and covariate data, where the covariate is continuous, and the distribution of the data is unknown to the seller. The main results of this paper are twofold. The first set of results is pricing strategy independent and reveals the fundamental information-theoretic limitation of any data-based pricing strategy in revenue generation for two cases: 3PD and uniform pricing. The second set of results proposes the $K$-markets empirical revenue maximization (ERM) strategy and shows that the $K$-markets ERM and the uniform ERM strategies achieve the optimal rate of convergence in revenue to that generated by their respective true-distribution 3PD and uniform pricing optima. Our theoretical and numerical results suggest that the uniform (i.e., $1$-market) ERM strategy generates a larger revenue than the $K$-markets ERM strategy when the sample size is small enough, and vice versa.
    Revisiting Simple Regret: Fast Rates for Returning a Good Arm. (arXiv:2210.16913v2 [cs.LG] UPDATED)
    Simple regret is a natural and parameter-free performance criterion for pure exploration in multi-armed bandits yet is less popular than the probability of missing the best arm or an $\epsilon$-good arm, perhaps due to lack of easy ways to characterize it. In this paper, we make significant progress on minimizing simple regret in both data-rich ($T\ge n$) and data-poor regime ($T \le n$) where $n$ is the number of arms, and $T$ is the number of samples. At its heart is our improved instance-dependent analysis of the well-known Sequential Halving (SH) algorithm, where we bound the probability of returning an arm whose mean reward is not within $\epsilon$ from the best (i.e., not $\epsilon$-good) for \textit{any} choice of $\epsilon>0$, although $\epsilon$ is not an input to SH. Our bound not only leads to an optimal worst-case simple regret bound of $\sqrt{n/T}$ up to logarithmic factors but also essentially matches the instance-dependent lower bound for returning an $\epsilon$-good arm reported by Katz-Samuels and Jamieson (2020). For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on real-world tasks.
    Correlated Initialization for Correlated Data. (arXiv:2003.04422v2 [cs.LG] UPDATED)
    Spatial data exhibits the property that nearby points are correlated. This also holds for learnt representations across layers, but not for commonly used weight initialization methods. Our theoretical analysis quantifies the learning behavior of weights of a single spatial filter. It is thus in contrast to a large body of work that discusses statistical properties of weights. It shows that uncorrelated initialization (i) might lead to poor convergence behavior and (ii) training of (some) parameters is likely subject to slow convergence. Empirical analysis shows that these findings for a single spatial filter extend to networks with many spatial filters. The impact of (correlated) initialization depends strongly on learning rates and l2-regularization.
    The KFIoU Loss for Rotated Object Detection. (arXiv:2201.12558v5 [cs.CV] UPDATED)
    Differing from the well-developed horizontal object detection area whereby the computing-friendly IoU based loss is readily adopted and well fits with the detection metrics. In contrast, rotation detectors often involve a more complicated loss based on SkewIoU which is unfriendly to gradient-based training. In this paper, we propose an effective approximate SkewIoU loss based on Gaussian modeing and Kalman filter, which mainly consists of two items. The first term is a scale-insensitive center point loss, which is used to quickly get the center points between bounding boxes closer to assist the second term. In the distance-independent second term, Kalman filter is adopted to inherently mimic the mechanism of SkewIoU by its definition, and show its alignment with the SkewIoU loss at trend-level within a certain distance (i.e. within 9 pixels). This is in contrast to recent Gaussian modeling based rotation detectors e.g. GWD loss and KLD loss that involve a human-specified distribution distance metric which require additional hyperparameter tuning that vary across datasets and detectors. The resulting new loss called KFIoU loss is easier to implement and works better compared with exact SkewIoU loss, thanks to its full differentiability and ability to handle the non-overlapping cases. We further extend our technique to the 3-D case which also suffers from the same issues as 2-D detection. Extensive results on various public datasets (2-D/3-D, aerial/text/face images) with different base detectors show the effectiveness of our approach.
    Unravelling the Performance of Physics-informed Graph Neural Networks for Dynamical Systems. (arXiv:2211.05520v2 [cs.LG] UPDATED)
    Recently, graph neural networks have been gaining a lot of attention to simulate dynamical systems due to their inductive nature leading to zero-shot generalizability. Similarly, physics-informed inductive biases in deep-learning frameworks have been shown to give superior performance in learning the dynamics of physical systems. There is a growing volume of literature that attempts to combine these two approaches. Here, we evaluate the performance of thirteen different graph neural networks, namely, Hamiltonian and Lagrangian graph neural networks, graph neural ODE, and their variants with explicit constraints and different architectures. We briefly explain the theoretical formulation highlighting the similarities and differences in the inductive biases and graph architecture of these systems. We evaluate these models on spring, pendulum, gravitational, and 3D deformable solid systems to compare the performance in terms of rollout error, conserved quantities such as energy and momentum, and generalizability to unseen system sizes. Our study demonstrates that GNNs with additional inductive biases, such as explicit constraints and decoupling of kinetic and potential energies, exhibit significantly enhanced performance. Further, all the physics-informed GNNs exhibit zero-shot generalizability to system sizes an order of magnitude larger than the training system, thus providing a promising route to simulate large-scale realistic systems.
    Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. (arXiv:1910.11621v2 [cs.CL] CROSS LISTED)
    Event detection (ED), a sub-task of event extraction, involves identifying triggers and categorizing event mentions. Existing methods primarily rely upon supervised learning and require large-scale labeled event datasets which are unfortunately not readily available in many real-life applications. In this paper, we consider and reformulate the ED task with limited labeled data as a Few-Shot Learning problem. We propose a Dynamic-Memory-Based Prototypical Network (DMB-PN), which exploits Dynamic Memory Network (DMN) to not only learn better prototypes for event types, but also produce more robust sentence encodings for event mentions. Differing from vanilla prototypical networks simply computing event prototypes by averaging, which only consume event mentions once, our model is more robust and is capable of distilling contextual information from event mentions for multiple times due to the multi-hop mechanism of DMNs. The experiments show that DMB-PN not only deals with sample scarcity better than a series of baseline models but also performs more robustly when the variety of event types is relatively large and the instance quantity is extremely small.
    Augmentation Component Analysis: Modeling Similarity via the Augmentation Overlaps. (arXiv:2206.00471v2 [cs.LG] UPDATED)
    Self-supervised learning aims to learn a embedding space where semantically similar samples are close. Contrastive learning methods pull views of samples together and push different samples away, which utilizes semantic invariance of augmentation but ignores the relationship between samples. To better exploit the power of augmentation, we observe that semantically similar samples are more likely to have similar augmented views. Therefore, we can take the augmented views as a special description of a sample. In this paper, we model such a description as the augmentation distribution and we call it augmentation feature. The similarity in augmentation feature reflects how much the views of two samples overlap and is related to their semantical similarity. Without computational burdens to explicitly estimate values of the augmentation feature, we propose Augmentation Component Analysis (ACA) with a contrastive-like loss to learn principal components and an on-the-fly projection loss to embed data. ACA equals an efficient dimension reduction by PCA and extracts low-dimensional embeddings, theoretically preserving the similarity of augmentation distribution between samples. Empirical results show our method can achieve competitive results against various traditional contrastive learning methods on different benchmarks.
    Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Communication Compression. (arXiv:2210.13277v3 [cs.LG] UPDATED)
    In federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
    $IC^3$: Image Captioning by Committee Consensus. (arXiv:2302.01328v1 [cs.CV])
    If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" ($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee
    Lower Bounds for Learning in Revealing POMDPs. (arXiv:2302.01333v1 [cs.LG])
    This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.
    Dual PatchNorm. (arXiv:2302.01327v1 [cs.CV])
    We propose Dual PatchNorm: two Layer Normalization layers (LayerNorms), before and after the patch embedding layer in Vision Transformers. We demonstrate that Dual PatchNorm outperforms the result of exhaustive search for alternative LayerNorm placement strategies in the Transformer block itself. In our experiments, incorporating this trivial modification, often leads to improved accuracy over well-tuned Vision Transformers and never hurts.
    Bayesian Metric Learning for Uncertainty Quantification in Image Retrieval. (arXiv:2302.01332v1 [cs.LG])
    We propose the first Bayesian encoder for metric learning. Rather than relying on neural amortization as done in prior works, we learn a distribution over the network weights with the Laplace Approximation. We actualize this by first proving that the contrastive loss is a valid log-posterior. We then propose three methods that ensure a positive definite Hessian. Lastly, we present a novel decomposition of the Generalized Gauss-Newton approximation. Empirically, we show that our Laplacian Metric Learner (LAM) estimates well-calibrated uncertainties, reliably detects out-of-distribution examples, and yields state-of-the-art predictive performance.
    The Value of Out-of-Distribution Data. (arXiv:2208.10967v3 [cs.LG] UPDATED)
    We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.
    Computational Discovery of Microstructured Composites with Optimal Strength-Toughness Trade-Offs. (arXiv:2302.01078v1 [cond-mat.mtrl-sci])
    The conflict between strength and toughness is a fundamental problem in engineering materials design. However, systematic discovery of microstructured composites with optimal strength-toughness trade-offs has never been demonstrated due to the discrepancies between simulation and reality and the lack of data-efficient exploration of the entire Pareto front. Here, we report a widely applicable pipeline harnessing physical experiments, numerical simulations, and artificial neural networks to efficiently discover microstructured designs that are simultaneously tough and strong. Using a physics-based simulator with moderate complexity, our strategy runs a data-driven proposal-validation workflow in a nested-loop fashion to bridge the gap between simulation and reality in high sample efficiency. Without any prescribed expert knowledge of materials design, our approach automatically identifies existing toughness enhancement mechanisms that were traditionally discovered through trial-and-error or biomimicry. We provide a blueprint for the computational discovery of optimal designs, which inverts traditional scientific approaches, and is applicable to a wide range of research problems beyond composites, including polymer chemistry, fluid dynamics, meteorology, and robotics.
    Error estimates for physics informed neural networks approximating the Navier-Stokes equations. (arXiv:2203.09346v2 [math.NA] UPDATED)
    We prove rigorous bounds on the errors resulting from the approximation of the incompressible Navier-Stokes equations with (extended) physics informed neural networks. We show that the underlying PDE residual can be made arbitrarily small for tanh neural networks with two hidden layers. Moreover, the total error can be estimated in terms of the training error, network size and number of quadrature points. The theory is illustrated with numerical experiments.
    Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations. (arXiv:2302.00981v1 [cs.LG])
    The study of molecule-target interaction is quite important for drug discovery in terms of target identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different molecule-target interaction tasks, MTINet+ significantly outperforms over the state-of-the-art baselines. In addition, in our designed network sparsity experiments , MTINet+ shows strong robustness against different sparse biomedical networks.
    Bayesian Optimization of Multiple Objectives with Different Latencies. (arXiv:2302.01310v1 [stat.ML])
    Multi-objective Bayesian optimization aims to find the Pareto front of optimal trade-offs between a set of expensive objectives while collecting as few samples as possible. In some cases, it is possible to evaluate the objectives separately, and a different latency or evaluation cost can be associated with each objective. This presents an opportunity to learn the Pareto front faster by evaluating the cheaper objectives more frequently. We propose a scalarization based knowledge gradient acquisition function which accounts for the different evaluation costs of the objectives. We prove consistency of the algorithm and show empirically that it significantly outperforms a benchmark algorithm which always evaluates both objectives.
    Efficient Privacy-Preserving Stochastic Nonconvex Optimization. (arXiv:1910.13659v3 [cs.LG] UPDATED)
    While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while improves the best previous utility guarantee given by Wang et al. (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
    Best Possible Q-Learning. (arXiv:2302.01188v1 [cs.LG])
    Fully decentralized learning, where the global information, i.e., the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose best possible operator, a novel decentralized operator, and prove that the policies of agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, best possible Q-learning (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.
    Factor Fields: A Unified Framework for Neural Fields and Beyond. (arXiv:2302.01226v1 [cs.CV])
    We present Factor Fields, a novel framework for modeling and representing signals. Factor Fields decomposes a signal into a product of factors, each of which is represented by a neural or regular field representation operating on a coordinate transformed input signal. We show that this decomposition yields a unified framework that generalizes several recent signal representations including NeRF, PlenOxels, EG3D, Instant-NGP, and TensoRF. Moreover, the framework allows for the creation of powerful new signal representations, such as the Coefficient-Basis Factorization (CoBaFa) which we propose in this paper. As evidenced by our experiments, CoBaFa leads to improvements over previous fast reconstruction methods in terms of the three critical goals in neural signal representation: approximation quality, compactness and efficiency. Experimentally, we demonstrate that our representation achieves better image approximation quality on 2D image regression tasks, higher geometric quality when reconstructing 3D signed distance fields and higher compactness for radiance field reconstruction tasks compared to previous fast reconstruction methods. Besides, our CoBaFa representation enables generalization by sharing the basis across signals during training, enabling generalization tasks such as image regression with sparse observations and few-shot radiance field reconstruction.
    Diagrammatization: Rationalizing with diagrammatic AI explanations for abductive reasoning on hypotheses. (arXiv:2302.01241v1 [cs.AI])
    Many visualizations have been developed for explainable AI (XAI), but they often require further reasoning by users to interpret. We argue that XAI should support abductive reasoning - inference to the best explanation - with diagrammatic reasoning to convey hypothesis generation and evaluation. Inspired by Peircean diagrammatic reasoning and the 5-step abduction process, we propose Diagrammatization, an approach to provide diagrammatic, abductive explanations based on domain hypotheses. We implemented DiagramNet for a clinical application to predict diagnoses from heart auscultation, and explain with shape-based murmur diagrams. In modeling studies, we found that DiagramNet not only provides faithful murmur shape explanations, but also has better prediction performance than baseline models. We further demonstrate the usefulness of diagrammatic explanations in a qualitative user study with medical students, showing that clinically-relevant, diagrammatic explanations are preferred over technical saliency map explanations. This work contributes insights into providing domain-conventional abductive explanations for user-centric XAI.
    Geometric Deep Learning for Autonomous Driving: Unlocking the Power of Graph Neural Networks With CommonRoad-Geometric. (arXiv:2302.01259v1 [cs.LG])
    Heterogeneous graphs offer powerful data representations for traffic, given their ability to model the complex interaction effects among a varying number of traffic participants and the underlying road infrastructure. With the recent advent of graph neural networks (GNNs) as the accompanying deep learning framework, the graph structure can be efficiently leveraged for various machine learning applications such as trajectory prediction. As a first of its kind, our proposed Python framework offers an easy-to-use and fully customizable data processing pipeline to extract standardized graph datasets from traffic scenarios. Providing a platform for GNN-based autonomous driving research, it improves comparability between approaches and allows researchers to focus on model implementation instead of dataset curation.
    Neuro Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal. (arXiv:2302.01242v1 [cs.LG])
    We introduce Neuro-Symbolic Continual Learning, where a model has to solve a sequence of neuro-symbolic tasks, that is, it has to map sub-symbolic inputs to high-level concepts and compute predictions by reasoning consistently with prior knowledge. Our key observation is that neuro-symbolic tasks, although different, often share concepts whose semantics remains stable over time. Traditional approaches fall short: existing continual strategies ignore knowledge altogether, while stock neuro-symbolic architectures suffer from catastrophic forgetting. We show that leveraging prior knowledge by combining neuro-symbolic architectures with continual strategies does help avoid catastrophic forgetting, but also that doing so can yield models affected by reasoning shortcuts. These undermine the semantics of the acquired concepts, even when detailed prior knowledge is provided upfront and inference is exact, and in turn continual performance. To overcome these issues, we introduce COOL, a COncept-level cOntinual Learning strategy tailored for neuro-symbolic continual problems that acquires high-quality concepts and remembers them over time. Our experiments on three novel benchmarks highlights how COOL attains sustained high performance on neuro-symbolic continual learning tasks in which other strategies fail.
    Federated Analytics: A survey. (arXiv:2302.01326v1 [cs.LG])
    Federated analytics (FA) is a privacy-preserving framework for computing data analytics over multiple remote parties (e.g., mobile devices) or silo-ed institutional entities (e.g., hospitals, banks) without sharing the data among parties. Motivated by the practical use cases of federated analytics, we follow a systematic discussion on federated analytics in this article. In particular, we discuss the unique characteristics of federated analytics and how it differs from federated learning. We also explore a wide range of FA queries and discuss various existing solutions and potential use case applications for different FA queries.
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v1 [stat.ML])
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on downstream datasets that are (i) relatively large, and (ii) similar in distribution to the pretraining data. However, in many applications including personalization, it is crucial to perform well in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and on images from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as 32$\times$ for CIFAR-100 at $\epsilon=1$. We also find that few-shot non-private models are highly susceptible to membership inference attacks. DP provides clear mitigation against the attacks, but a small $\epsilon$ is required to effectively prevent them. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR federated learning benchmark.
    Normalizing Flow Ensembles for Rich Aleatoric and Epistemic Uncertainty Modeling. (arXiv:2302.01312v1 [cs.LG])
    In this work, we demonstrate how to reliably estimate epistemic uncertainty while maintaining the flexibility needed to capture complicated aleatoric distributions. To this end, we propose an ensemble of Normalizing Flows (NF), which are state-of-the-art in modeling aleatoric uncertainty. The ensembles are created via sets of fixed dropout masks, making them less expensive than creating separate NF models. We demonstrate how to leverage the unique structure of NFs, base distributions, to estimate aleatoric uncertainty without relying on samples, provide a comprehensive set of baselines, and derive unbiased estimates for differential entropy. The methods were applied to a variety of experiments, commonly used to benchmark aleatoric and epistemic uncertainty estimation: 1D sinusoidal data, 2D windy grid-world ($\it{Wet Chicken}$), $\it{Pendulum}$, and $\it{Hopper}$. In these experiments, we setup an active learning framework and evaluate each model's capability at measuring aleatoric and epistemic uncertainty. The results show the advantages of using NF ensembles in capturing complicated aleatoric while maintaining accurate epistemic uncertainty estimates.
    UW-CVGAN: UnderWater Image Enhancement with Capsules Vectors Quantization. (arXiv:2302.01144v1 [cs.CV])
    The degradation in the underwater images is due to wavelength-dependent light attenuation, scattering, and to the diversity of the water types in which they are captured. Deep neural networks take a step in this field, providing autonomous models able to achieve the enhancement of underwater images. We introduce Underwater Capsules Vectors GAN UWCVGAN based on the discrete features quantization paradigm from VQGAN for this task. The proposed UWCVGAN combines an encoding network, which compresses the image into its latent representation, with a decoding network, able to reconstruct the enhancement of the image from the only latent representation. In contrast with VQGAN, UWCVGAN achieves feature quantization by exploiting the clusterization ability of capsule layer, making the model completely trainable and easier to manage. The model obtains enhanced underwater images with high quality and fine details. Moreover, the trained encoder is independent of the decoder giving the possibility to be embedded onto the collector as compressing algorithm to reduce the memory space required for the images, of factor $3\times$. \myUWCVGAN{ }is validated with quantitative and qualitative analysis on benchmark datasets, and we present metrics results compared with the state of the art.
    A comparative study of statistical and machine learning models on near-real-time daily emissions prediction. (arXiv:2302.01152v1 [cs.AI])
    The rapid ascent in carbon dioxide emissions is a major cause of global warming and climate change, which pose a huge threat to human survival and impose far-reaching influence on the global ecosystem. Therefore, it is very necessary to effectively control carbon dioxide emissions by accurately predicting and analyzing the change trend timely, so as to provide a reference for carbon dioxide emissions mitigation measures. This paper is aiming to select a suitable model to predict the near-real-time daily emissions based on univariate daily time-series data from January 1st, 2020 to September 30st, 2022 of all sectors (Power, Industry, Ground Transport, Residential, Domestic Aviation, International Aviation) in China. We proposed six prediction models, which including three statistical models: Grey prediction (GM(1,1)), autoregressive integrated moving average (ARIMA) and seasonal autoregressive integrated moving average with exogenous factors (SARIMAX); three machine learning models: artificial neural network (ANN), random forest (RF) and long short term memory (LSTM). To evaluate the performance of these models, five criteria: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Coefficient of Determination () are imported and discussed in detail. In the results, three machine learning models perform better than that three statistical models, in which LSTM model performs the best on five criteria values for daily emissions prediction with the 3.5179e-04 MSE value, 0.0187 RMSE value, 0.0140 MAE value, 14.8291% MAPE value and 0.9844 value.
    Temporal fusion transformer using variational mode decomposition for wind power forecasting. (arXiv:2302.01222v1 [cs.LG])
    The power output of a wind turbine depends on a variety of factors, including wind speed at different heights, wind direction, temperature and turbine properties. Wind speed and direction, in particular, have complex cycles and fluctuate dramatically, leading to large uncertainties in wind power output. This study uses variational mode decomposition (VMD) to decompose the wind power series and Temporal fusion transformer (TFT) to forecast wind power for the next 1h, 3h and 6h. The experimental results show that VMD outperforms other decomposition algorithms and the TFT model outperforms other decomposition models.
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v1 [stat.ML])
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.
    Dual Propagation: Accelerating Contrastive Hebbian Learning with Dyadic Neurons. (arXiv:2302.01228v1 [cs.LG])
    Activity difference based learning algorithms-such as contrastive Hebbian learning and equilibrium propagation-have been proposed as biologically plausible alternatives to error back-propagation. However, on traditional digital chips these algorithms suffer from having to solve a costly inference problem twice, making these approaches more than two orders of magnitude slower than back-propagation. In the analog realm equilibrium propagation may be promising for fast and energy efficient learning, but states still need to be inferred and stored twice. Inspired by lifted neural networks and compartmental neuron models we propose a simple energy based compartmental neuron model, termed dual propagation, in which each neuron is a dyad with two intrinsic states. At inference time these intrinsic states encode the error/activity duality through their difference and their mean respectively. The advantage of this method is that only a single inference phase is needed and that inference can be solved in layerwise closed-form. Experimentally we show on common computer vision datasets, including Imagenet32x32, that dual propagation performs equivalently to back-propagation both in terms of accuracy and runtime.
    Double Permutation Equivariance for Knowledge Graph Completion. (arXiv:2302.01313v1 [cs.LG])
    This work provides a formalization of Knowledge Graphs (KGs) as a new class of graphs that we denote doubly exchangeable attributed graphs, where node and pairwise (joint 2-node) representations must be equivariant to permutations of both node ids and edge (& node) attributes (relations & node features). Double-permutation equivariant KG representations open a new research direction in KGs. We show that this equivariance imposes a structural representation of relations that allows neural networks to perform complex logical reasoning tasks in KGs. Finally, we introduce a general blueprint for such equivariant representations and test a simple GNN-based double-permutation equivariant neural architecture that achieve 100% Hits@10 test accuracy in both the WN18RRv1 and NELL995v1 inductive KG completion tasks, and can accurately perform logical reasoning tasks that no existing methods can perform, to the best of our knowledge.
    Fed-GLOSS-DP: Federated, Global Learning using Synthetic Sets with Record Level Differential Privacy. (arXiv:2302.01068v1 [cs.LG])
    This work proposes Fed-GLOSS-DP, a novel approach to privacy-preserving learning that uses synthetic data to train federated models. In our approach, the server recovers an approximation of the global loss landscape in a local neighborhood based on synthetic samples received from the clients. In contrast to previous, point-wise, gradient-based, linear approximation (such as FedAvg), our formulation enables a type of global optimization that is particularly beneficial in non-IID federated settings. We also present how it rigorously complements record-level differential privacy. Extensive results show that our novel formulation gives rise to considerable improvements in terms of convergence speed and communication costs. We argue that our new approach to federated learning can provide a potential path toward reconciling privacy and accountability by sending differentially private, synthetic data instead of gradient updates. The source code will be released upon publication.
    Practical Bandits: An Industry Perspective. (arXiv:2302.01223v1 [cs.LG])
    The bandit paradigm provides a unified modeling framework for problems that require decision-making under uncertainty. Because many business metrics can be viewed as rewards (a.k.a. utilities) that result from actions, bandit algorithms have seen a large and growing interest from industrial applications, such as search, recommendation and advertising. Indeed, with the bandit lens comes the promise of direct optimisation for the metrics we care about. Nevertheless, the road to successfully applying bandits in production is not an easy one. Even when the action space and rewards are well-defined, practitioners still need to make decisions regarding multi-arm or contextual approaches, on- or off-policy setups, delayed or immediate feedback, myopic or long-term optimisation, etc. To make matters worse, industrial platforms typically give rise to large action spaces in which existing approaches tend to break down. The research literature on these topics is broad and vast, but this can overwhelm practitioners, whose primary aim is to solve practical problems, and therefore need to decide on a specific instantiation or approach for each project. This tutorial will take a step towards filling that gap between the theory and practice of bandits. Our goal is to present a unified overview of the field and its existing terminology, concepts and algorithms -- with a focus on problems relevant to industry. We hope our industrial perspective will help future practitioners who wish to leverage the bandit paradigm for their application.
    Interventional and Counterfactual Inference with Diffusion Models. (arXiv:2302.00860v1 [stat.ML])
    We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings to allow for direct sampling under interventions as well as abduction for counterfactuals. We utilize DCM to model structural equations, seeing that diffusion models serve as a natural candidate here since they encode each node to a latent representation, a proxy for the exogenous noise, and offer flexible and accurate modeling to provide reliable causal statements and estimates. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Our theoretical results provide a methodology for analyzing the counterfactual error for general encoder/decoder models which could be of independent interest.
    Reliable Prediction Intervals with Directly Optimized Inductive Conformal Regression for Deep Learning. (arXiv:2302.00872v1 [cs.LG])
    By generating prediction intervals (PIs) to quantify the uncertainty of each prediction in deep learning regression, the risk of wrong predictions can be effectively controlled. High-quality PIs need to be as narrow as possible, whilst covering a preset proportion of real labels. At present, many approaches to improve the quality of PIs can effectively reduce the width of PIs, but they do not ensure that enough real labels are captured. Inductive Conformal Predictor (ICP) is an algorithm that can generate effective PIs which is theoretically guaranteed to cover a preset proportion of data. However, typically ICP is not directly optimized to yield minimal PI width. However, in this study, we use Directly Optimized Inductive Conformal Regression (DOICR) that takes only the average width of PIs as the loss function and increases the quality of PIs through an optimized scheme under the validity condition that sufficient real labels are captured in the PIs. Benchmark experiments show that DOICR outperforms current state-of-the-art algorithms for regression problems using underlying Deep Neural Network structures for both tabular and image data.
    What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. (arXiv:2302.01308v1 [cs.CL])
    Understanding the extent to which the perceptual world can be recovered from language is a fundamental problem in cognitive science. We reformulate this problem as that of distilling psychophysical information from text and show how this can be done by combining large language models (LLMs) with a classic psychophysical method based on similarity judgments. Specifically, we use the prompt auto-completion functionality of GPT3, a state-of-the-art LLM, to elicit similarity scores between stimuli and then apply multidimensional scaling to uncover their underlying psychological space. We test our approach on six perceptual domains and show that the elicited judgments strongly correlate with human data and successfully recover well-known psychophysical structures such as the color wheel and pitch spiral. We also explore meaningful divergences between LLM and human representations. Our work showcases how combining state-of-the-art machine models with well-known cognitive paradigms can shed new light on fundamental questions in perception and language research.
    Are Diffusion Models Vulnerable to Membership Inference Attacks?. (arXiv:2302.01316v1 [cs.CV])
    Diffusion-based generative models have shown great potential for image synthesis, but there is a lack of research on the security and privacy risks they may pose. In this paper, we investigate the vulnerability of diffusion models to Membership Inference Attacks (MIAs), a common privacy concern. Our results indicate that existing MIAs designed for GANs or VAE are largely ineffective on diffusion models, either due to inapplicable scenarios (e.g., requiring the discriminator of GANs) or inappropriate assumptions (e.g., closer distances between synthetic images and member images). To address this gap, we propose Step-wise Error Comparing Membership Inference (SecMI), a black-box MIA that infers memberships by assessing the matching of forward process posterior estimation at each timestep. SecMI follows the common overfitting assumption in MIA where member samples normally have smaller estimation errors, compared with hold-out samples. We consider both the standard diffusion models, e.g., DDPM, and the text-to-image diffusion models, e.g., Stable Diffusion. Experimental results demonstrate that our methods precisely infer the membership with high confidence on both of the two scenarios across six different datasets
    MonoFlow: Rethinking Divergence GANs via the Perspective of Differential Equations. (arXiv:2302.01075v1 [stat.ML])
    The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.
    Convolutional Autoencoders, Clustering and POD for Low-dimensional Parametrization of Navier-Stokes Equations. (arXiv:2302.01278v1 [math.DS])
    Simulations of large-scale dynamical systems require expensive computations. Low-dimensional parametrization of high-dimensional states such as Proper Orthogonal Decomposition (POD) can be a solution to lessen the burdens by providing a certain compromise between accuracy and model complexity. However, for really low-dimensional parametrizations (for example for controller design) linear methods like the POD come to their natural limits so that nonlinear approaches will be the methods of choice. In this work we propose a convolutional autoencoder (CAE) consisting of a nonlinear encoder and an affine linear decoder and consider combinations with k-means clustering for improved encoding performance. The proposed set of methods is compared to the standard POD approach in two cylinder-wake scenarios modeled by the incompressible Navier-Stokes equations.
    MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion Control in Real Networks. (arXiv:2302.01301v1 [cs.LG])
    Fast and efficient transport protocols are the foundation of an increasingly distributed world. The burden of continuously delivering improved communication performance to support next-generation applications and services, combined with the increasing heterogeneity of systems and network technologies, has promoted the design of Congestion Control (CC) algorithms that perform well under specific environments. The challenge of designing a generic CC algorithm that can adapt to a broad range of scenarios is still an open research question. To tackle this challenge, we propose to apply a novel Reinforcement Learning (RL) approach. Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return and models the learning process as an infinite-horizon task. We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch that researchers have encountered when applying RL to CC. We evaluated our solution on the task of file transfer and compared it to TCP Cubic. While further research is required, results have shown that MARLIN can achieve comparable results to TCP with little hyperparameter tuning, in a task significantly different from its training setting. Therefore, we believe that our work represents a promising first step toward building CC algorithms based on the maximum entropy RL framework.
    Energy Efficiency of Training Neural Network Architectures: An Empirical Study. (arXiv:2302.00967v1 [cs.LG])
    The evaluation of Deep Learning models has traditionally focused on criteria such as accuracy, F1 score, and related measures. The increasing availability of high computational power environments allows the creation of deeper and more complex models. However, the computations needed to train such models entail a large carbon footprint. In this work, we study the relations between DL model architectures and their environmental impact in terms of energy consumed and CO$_2$ emissions produced during training by means of an empirical study using Deep Convolutional Neural Networks. Concretely, we study: (i) the impact of the architecture and the location where the computations are hosted on the energy consumption and emissions produced; (ii) the trade-off between accuracy and energy efficiency; and (iii) the difference on the method of measurement of the energy consumed using software-based and hardware-based tools.
    Curriculum Learning for ab initio Deep Learned Refractive Optics. (arXiv:2302.01089v1 [cs.CV])
    Deep lens optimization has recently emerged as a new paradigm for designing computational imaging systems, however it has been limited to either simple optical systems consisting of a single DOE or metalens, or the fine-tuning of compound lenses from good initial designs. Here we present a deep lens design method based on curriculum learning, which is able to learn optical designs of compound lenses ab initio from randomly initialized surfaces, therefore overcoming the need for a good initial design. We demonstrate this approach with the fully-automatic design of an extended depth-of-field computational camera in a cellphone-style form factor, highly aspherical surfaces, and a short back focal length.
    Convolutional Neural Operators. (arXiv:2302.01178v1 [cs.LG])
    Although very successfully used in machine learning, convolution based neural network architectures -- believed to be inconsistent in function space -- have been largely ignored in the context of learning solution operators of PDEs. Here, we adapt convolutional neural networks to demonstrate that they are indeed able to process functions as inputs and outputs. The resulting architecture, termed as convolutional neural operators (CNOs), is shown to significantly outperform competing models on benchmark experiments, paving the way for the design of an alternative robust and accurate framework for learning operators.
    Online Bidding in Repeated Non-Truthful Auctions under Budget and ROI Constraints. (arXiv:2302.01203v1 [cs.GT])
    Online advertising platforms typically use auction mechanisms to allocate ad placements. Advertisers participate in a series of repeated auctions, and must select bids that will maximize their overall rewards while adhering to certain constraints. We focus on the scenario in which the advertiser has budget and return-on-investment (ROI) constraints. We investigate the problem of budget- and ROI-constrained bidding in repeated non-truthful auctions, such as first-price auctions, and present a best-of-both-worlds framework with no-regret guarantees under both stochastic and adversarial inputs. By utilizing the notion of interval regret, we demonstrate that our framework does not require knowledge of specific parameters of the problem which could be difficult to determine in practice. Our proof techniques can be applied to both the adversarial and stochastic cases with minimal modifications, thereby providing a unified perspective on the two problems. In the adversarial setting, we also show that it is possible to loosen the traditional requirement of having a strictly feasible solution to the offline optimization problem at each round.
    Conditional expectation for missing data imputation. (arXiv:2302.00911v1 [stat.ML])
    Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.
    Speed-Oblivious Online Scheduling: Knowing (Precise) Speeds is not Necessary. (arXiv:2302.00985v1 [cs.DS])
    We consider online scheduling on unrelated (heterogeneous) machines in a speed-oblivious setting, where an algorithm is unaware of the exact job-dependent processing speeds. We show strong impossibility results for clairvoyant and non-clairvoyant algorithms and overcome them in models inspired by practical settings: (i) we provide competitive learning-augmented algorithms, assuming that (possibly erroneous) predictions on the speeds are given, and (ii) we provide competitive algorithms for the speed-ordered model, where a single global order of machines according to their unknown job-dependent speeds is known. We prove strong theoretical guarantees and evaluate our findings on a representative heterogeneous multi-core processor. These seem to be the first empirical results for algorithms with predictions that are performed in a non-synthetic environment on real hardware.
    Imitating careful experts to avoid catastrophic events. (arXiv:2302.01193v1 [cs.LG])
    RL is increasingly being used to control robotic systems that interact closely with humans. This interaction raises the problem of safe RL: how to ensure that a RL-controlled robotic system never, for instance, injures a human. This problem is especially challenging in rich, realistic settings where it is not even possible to clearly write down a reward function which incorporates these outcomes. In these circumstances, perhaps the only viable approach is based on IRL, which infers rewards from human demonstrations. However, IRL is massively underdetermined as many different rewards can lead to the same optimal policies; we show that this makes it difficult to distinguish catastrophic outcomes (such as injuring a human) from merely undesirable outcomes. Our key insight is that humans do display different behaviour when catastrophic outcomes are possible: they become much more careful. We incorporate carefulness signals into IRL, and find that they do indeed allow IRL to disambiguate undesirable from catastrophic outcomes, which is critical to ensuring safety in future real-world human-robot interactions.
    Graph Neural Networks for temporal graphs: State of the art, open challenges, and opportunities. (arXiv:2302.01018v1 [cs.LG])
    Graph Neural Networks (GNNs) have become the leading paradigm for learning on (static) graph-structured data. However, many real-world systems are dynamic in nature, since the graph and node/edge attributes change over time. In recent years, GNN-based models for temporal graphs have emerged as a promising area of research to extend the capabilities of GNNs. In this work, we provide the first comprehensive overview of the current state-of-the-art of temporal GNN, introducing a rigorous formalization of learning settings and tasks and a novel taxonomy categorizing existing approaches in terms of how the temporal aspect is represented and processed. We conclude the survey with a discussion of the most relevant open challenges for the field, from both research and application perspectives.
    Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation. (arXiv:2302.01094v1 [cs.LG])
    This work aims to assess how well a model performs under distribution shifts without using labels. While recent methods study prediction confidence, this work reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. Our key insight is that a well-performing model should give predictions with high confidence and high dispersity. That is, we need to consider both properties so as to make more accurate estimates. To this end, we use the nuclear norm that has been shown to be effective in characterizing both properties. Extensive experiments validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that the nuclear norm is more accurate and robust in accuracy estimation than existing methods. Furthermore, we validate the feasibility of other measurements (e.g., mutual information maximization) for characterizing dispersity and confidence. Lastly, we investigate the limitation of the nuclear norm, study its improved variant under severe class imbalance, and discuss potential directions.
    Laplacian Change Point Detection for Single and Multi-view Dynamic Graphs. (arXiv:2302.01204v1 [cs.LG])
    Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly detection in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: i). how to compare graph snapshots across time, ii). how to capture temporal dependencies, and iii). how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short term and long term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that i). LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, ii). MultiLAD's advantage over contenders significantly increases when additional views are available, and iii). MultiLAD is highly robust to noise from individual views. In five real world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.
    Human not in the loop: objective sample difficulty measures for Curriculum Learning. (arXiv:2302.01243v1 [cs.CV])
    Curriculum learning is a learning method that trains models in a meaningful order from easier to harder samples. A key here is to devise automatic and objective difficulty measures of samples. In the medical domain, previous work applied domain knowledge from human experts to qualitatively assess classification difficulty of medical images to guide curriculum learning, which requires extra annotation efforts, relies on subjective human experience, and may introduce bias. In this work, we propose a new automated curriculum learning technique using the variance of gradients (VoG) to compute an objective difficulty measure of samples and evaluated its effects on elbow fracture classification from X-ray images. Specifically, we used VoG as a metric to rank each sample in terms of the classification difficulty, where high VoG scores indicate more difficult cases for classification, to guide the curriculum training process We compared the proposed technique to a baseline (without curriculum learning), a previous method that used human annotations on classification difficulty, and anti-curriculum learning. Our experiment results showed comparable and higher performance for the binary and multi-class bone fracture classification tasks.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v2 [cs.LG] UPDATED)
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.
    Construction and Applications of Billion-Scale Pre-trained Multimodal Business Knowledge Graph. (arXiv:2209.15214v3 [cs.AI] CROSS LISTED)
    Business Knowledge Graphs (KGs) are important to many enterprises today, providing factual knowledge and structured data that steer many products and make them more intelligent. Despite their promising benefits, building business KG necessitates solving prohibitive issues of deficient structure and multiple modalities. In this paper, we advance the understanding of the practical challenges related to building KG in non-trivial real-world systems. We introduce the process of building an open business knowledge graph (OpenBG) derived from a well-known enterprise, Alibaba Group. Specifically, we define a core ontology to cover various abstract products and consumption demands, with fine-grained taxonomy and multimodal facts in deployed applications. OpenBG is an open business KG of unprecedented scale: 2.6 billion triples with more than 88 million entities covering over 1 million core classes/concepts and 2,681 types of relations. We release all the open resources (OpenBG benchmarks) derived from it for the community and report experimental results of KG-centric tasks. We also run up an online competition based on OpenBG benchmarks, and has attracted thousands of teams. We further pre-train OpenBG and apply it to many KG- enhanced downstream tasks in business scenarios, demonstrating the effectiveness of billion-scale multimodal knowledge for e-commerce. All the resources with codes have been released at \url{https://github.com/OpenBGBenchmark/OpenBG}.
    Constrained Online Two-stage Stochastic Optimization: New Algorithms via Adversarial Learning. (arXiv:2302.00997v1 [cs.LG])
    We consider an online two-stage stochastic optimization with long-term constraints over a finite horizon of $T$ periods. At each period, we take the first-stage action, observe a model parameter realization and then take the second-stage action from a feasible set that depends both on the first-stage decision and the model parameter. We aim to minimize the cumulative objective value while guaranteeing that the long-term average second-stage decision belongs to a set. We propose a general algorithmic framework that derives online algorithms for the online two-stage problem from adversarial learning algorithms. Also, the regret bound of our algorithm cam be reduced to the regret bound of embedded adversarial learning algorithms. Based on our framework, we obtain new results under various settings. When the model parameter at each period is drawn from identical distributions, we derive state-of-art regret bound that improves previous bounds under special cases. Our algorithm is also robust to adversarial corruptions of model parameter realizations. When the model parameters are drawn from unknown non-stationary distributions and we are given prior estimates of the distributions, we develop a new algorithm from our framework with a regret $O(W_T+\sqrt{T})$, where $W_T$ measures the total inaccuracy of the prior estimates.
    Surprising Instabilities in Training Deep Networks and a Theoretical Analysis. (arXiv:2206.02001v3 [cs.LG] UPDATED)
    We discover restrained numerical instabilities in current training practices of deep networks with stochastic gradient descent (SGD). We show numerical error (on the order of the smallest floating point bit) induced from floating point arithmetic in training deep nets can be amplified significantly and result in significant test accuracy variance, comparable to the test accuracy variance due to stochasticity in SGD. We show how this is likely traced to instabilities of the optimization dynamics that are restrained, i.e., localized over iterations and regions of the weight tensor space. We do this by presenting a theoretical framework using numerical analysis of partial differential equations (PDE), and analyzing the gradient descent PDE of convolutional neural networks (CNNs). We show that it is stable only under certain conditions on the learning rate and weight decay. We show that rather than blowing up when the conditions are violated, the instability can be restrained. We show this is a consequence of the non-linear PDE associated with the gradient descent of the CNN, whose local linearization changes when over-driving the step size of the discretization, resulting in a stabilizing effect. We link restrained instabilities to the recently discovered Edge of Stability (EoS) phenomena, in which the stable step size predicted by classical theory is exceeded while continuing to optimize the loss and still converging. Because restrained instabilities occur at the EoS, our theory provides new predictions about the EoS, in particular, the role of regularization and the dependence on the network complexity.
    Prediction-Powered Inference. (arXiv:2301.09633v2 [stat.ML] UPDATED)
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.
    GANalyzer: Analysis and Manipulation of GANs Latent Space for Controllable Face Synthesis. (arXiv:2302.00908v1 [cs.CV])
    Generative Adversarial Networks (GANs) are capable of synthesizing high-quality facial images. Despite their success, GANs do not provide any information about the relationship between the input vectors and the generated images. Currently, facial GANs are trained on imbalanced datasets, which generate less diverse images. For example, more than 77% of 100K images that we randomly synthesized using the StyleGAN3 are classified as Happy, and only around 3% are Angry. The problem even becomes worse when a mixture of facial attributes is desired: less than 1% of the generated samples are Angry Woman, and only around 2% are Happy Black. To address these problems, this paper proposes a framework, called GANalyzer, for the analysis, and manipulation of the latent space of well-trained GANs. GANalyzer consists of a set of transformation functions designed to manipulate latent vectors for a specific facial attribute such as facial Expression, Age, Gender, and Race. We analyze facial attribute entanglement in the latent space of GANs and apply the proposed transformation for editing the disentangled facial attributes. Our experimental results demonstrate the strength of GANalyzer in editing facial attributes and generating any desired faces. We also create and release a balanced photo-realistic human face dataset. Our code is publicly available on GitHub.
    Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start". (arXiv:2302.00932v1 [cs.LG])
    Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.
    FCB-SwinV2 Transformer for Polyp Segmentation. (arXiv:2302.01027v1 [cs.CV])
    Polyp segmentation within colonoscopy video frames using deep learning models has the potential to automate the workflow of clinicians. This could help improve the early detection rate and characterization of polyps which could progress to colorectal cancer. Recent state-of-the-art deep learning polyp segmentation models have combined the outputs of Fully Convolutional Network architectures and Transformer Network architectures which work in parallel. In this paper we propose modifications to the current state-of-the-art polyp segmentation model FCBFormer. The transformer architecture of the FCBFormer is replaced with a SwinV2 Transformer-UNET and minor changes to the Fully Convolutional Network architecture are made to create the FCB-SwinV2 Transformer. The performance of the FCB-SwinV2 Transformer is evaluated on the popular colonoscopy segmentation bench-marking datasets Kvasir-SEG and CVC-ClinicDB. Generalizability tests are also conducted. The FCB-SwinV2 Transformer is able to consistently achieve higher mDice scores across all tests conducted and therefore represents new state-of-the-art performance. Issues found with how colonoscopy segmentation model performance is evaluated within literature are also re-ported and discussed. One of the most important issues identified is that when evaluating performance on the CVC-ClinicDB dataset it would be preferable to ensure no data leakage from video sequences occurs during the training/validation/test data partition.
    A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. (arXiv:2302.01217v1 [stat.ML])
    We provide a theoretical justification for sample recovery using diffusion based image inpainting in a linear model setting. While most inpainting algorithms require retraining with each new mask, we prove that diffusion based inpainting generalizes well to unseen masks without retraining. We analyze a recently proposed popular diffusion based inpainting algorithm called RePaint (Lugmayr et al., 2022), and show that it has a bias due to misalignment that hampers sample recovery even in a two-state diffusion process. Motivated by our analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that provably recovers the underlying true sample and enjoys a linear rate of convergence. It achieves this by rectifying the misalignment error present in drift and dispersion of the reverse process. To the best of our knowledge, this is the first linear convergence result for a diffusion based image inpainting algorithm.
    Symbolic Physics Learner: Discovering governing equations via Monte Carlo tree search. (arXiv:2205.13134v2 [cs.AI] UPDATED)
    Nonlinear dynamics is ubiquitous in nature and commonly seen in various science and engineering disciplines. Distilling analytical expressions that govern nonlinear dynamics from limited data remains vital but challenging. To tackle this fundamental issue, we propose a novel Symbolic Physics Learner (SPL) machine to discover the mathematical structure of nonlinear dynamics. The key concept is to interpret mathematical operations and system state variables by computational rules and symbols, establish symbolic reasoning of mathematical formulas via expression trees, and employ a Monte Carlo tree search (MCTS) agent to explore optimal expression trees based on measurement data. The MCTS agent obtains an optimistic selection policy through the traversal of expression trees, featuring the one that maps to the arithmetic expression of underlying physics. Salient features of the proposed framework include search flexibility and enforcement of parsimony for discovered equations. The efficacy and superiority of the SPL machine are demonstrated by numerical examples, compared with state-of-the-art baselines.
    Model Monitoring and Robustness of In-Use Machine Learning Models: Quantifying Data Distribution Shifts Using Population Stability Index. (arXiv:2302.00775v1 [cs.LG])
    Safety goes first. Meeting and maintaining industry safety standards for robustness of artificial intelligence (AI) and machine learning (ML) models require continuous monitoring for faults and performance drops. Deep learning models are widely used in industrial applications, e.g., computer vision, but the susceptibility of their performance to environment changes (e.g., noise) \emph{after deployment} on the product, are now well-known. A major challenge is detecting data distribution shifts that happen, comparing the following: {\bf (i)} development stage of AI and ML models, i.e., train/validation/test, to {\bf (ii)} deployment stage on the product (i.e., even after `testing') in the environment. We focus on a computer vision example related to autonomous driving and aim at detecting shifts that occur as a result of adding noise to images. We use the population stability index (PSI) as a measure of presence and intensity of shift and present results of our empirical experiments showing a promising potential for the PSI. We further discuss multiple aspects of model monitoring and robustness that need to be analyzed \emph{simultaneously} to achieve robustness for industry safety standards. We propose the need for and the research direction toward \emph{categorizations} of problem classes and examples where monitoring for robustness is required and present challenges and pointers for future work from a \emph{practical} perspective.
    Efficient Graph Field Integrators Meet Point Clouds. (arXiv:2302.00942v1 [cs.LG])
    We present two new classes of algorithms for efficient field integration on graphs encoding point clouds. The first class, SeparatorFactorization(SF), leverages the bounded genus of point cloud mesh graphs, while the second class, RFDiffusion(RFD), uses popular epsilon-nearest-neighbor graph representations for point clouds. Both can be viewed as providing the functionality of Fast Multipole Methods (FMMs), which have had a tremendous impact on efficient integration, but for non-Euclidean spaces. We focus on geometries induced by distributions of walk lengths between points (e.g., shortest-path distance). We provide an extensive theoretical analysis of our algorithms, obtaining new results in structural graph theory as a byproduct. We also perform exhaustive empirical evaluation, including on-surface interpolation for rigid and deformable objects (particularly for mesh-dynamics modeling), Wasserstein distance computations for point clouds, and the Gromov-Wasserstein variant.
    Variational Autoencoder Learns Better Feature Representations for EEG-based Obesity Classification. (arXiv:2302.00789v1 [cs.LG])
    Obesity is a common issue in modern societies today that can lead to various diseases and significantly reduced quality of life. Currently, research has been conducted to investigate resting state EEG (electroencephalogram) signals with an aim to identify possible neurological characteristics associated with obesity. In this study, we propose a deep learning-based framework to extract the resting state EEG features for obese and lean subject classification. Specifically, a novel variational autoencoder framework is employed to extract subject-invariant features from the raw EEG signals, which are then classified by a 1-D convolutional neural network. Comparing with conventional machine learning and deep learning methods, we demonstrate the superiority of using VAE for feature extraction, as reflected by the significantly improved classification accuracies, better visualizations and reduced impurity measures in the feature representations. Future work can be directed to gaining an in-depth understanding regarding the spatial patterns that have been learned by the proposed model from a neurological view, as well as improving the interpretability of the proposed model by allowing it to uncover any temporal-related information.
    Using Machine Learning to Develop Smart Reflex Testing Protocols. (arXiv:2302.00794v1 [cs.LG])
    Objective: Reflex testing protocols allow clinical laboratories to perform second line diagnostic tests on existing specimens based on the results of initially ordered tests. Reflex testing can support optimal clinical laboratory test ordering and diagnosis. In current clinical practice, reflex testing typically relies on simple "if-then" rules; however, this limits their scope since most test ordering decisions involve more complexity than a simple rule will allow. Here, using the analyte ferritin as an example, we propose an alternative machine learning-based approach to "smart" reflex testing with a wider scope and greater impact than traditional rule-based approaches. Methods: Using patient data, we developed a machine learning model to predict whether a patient getting CBC testing will also have ferritin testing ordered, consider applications of this model to "smart" reflex testing, and evaluate the model by comparing its performance to possible rule-based approaches. Results: Our underlying machine learning models performed moderately well in predicting ferritin test ordering and demonstrated greater suitability to reflex testing than rule-based approaches. Using chart review, we demonstrate that our model may improve ferritin test ordering. Finally, as a secondary goal, we demonstrate that ferritin test results are missing not at random (MNAR), a finding with implications for unbiased imputation of missing test results. Conclusions: Machine learning may provide a foundation for new types of reflex testing with enhanced benefits for clinical diagnosis and laboratory utilization management.
    Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. (arXiv:2302.01002v1 [stat.ML])
    We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.
    Analysis of Biomass Sustainability Indicators from a Machine Learning Perspective. (arXiv:2302.00828v1 [cs.AI])
    Plant biomass estimation is critical due to the variability of different environmental factors and crop management practices associated with it. The assessment is largely impacted by the accurate prediction of different environmental sustainability indicators. A robust model to predict sustainability indicators is a must for the biomass community. This study proposes a robust model for biomass sustainability prediction by analyzing sustainability indicators using machine learning models. The prospect of ensemble learning was also investigated to analyze the regression problem. All experiments were carried out on a crop residue data from the Ohio state. Ten machine learning models, namely, linear regression, ridge regression, multilayer perceptron, k-nearest neighbors, support vector machine, decision tree, gradient boosting, random forest, stacking and voting, were analyzed to estimate three biomass sustainability indicators, namely soil erosion factor, soil conditioning index, and organic matter factor. The performance of the model was assessed using cross-correlation (R2), root mean squared error and mean absolute error metrics. The results showed that Random Forest was the best performing model to assess sustainability indicators. The analyzed model can now serve as a guide for assessing sustainability indicators in real time.
    Empirical Analysis of the AdaBoost's Error Bound. (arXiv:2302.00880v1 [cs.LG])
    Understanding the accuracy limits of machine learning algorithms is essential for data scientists to properly measure performance so they can continually improve their models' predictive capabilities. This study empirically verified the error bound of the AdaBoost algorithm for both synthetic and real-world data. The results show that the error bound holds up in practice, demonstrating its efficiency and importance to a variety of applications. The corresponding source code is available at https://github.com/armanbolatov/adaboost_error_bound.
    Noncommutative $C^*$-algebra Net: Learning Neural Networks with Powerful Product Structure in $C^*$-algebra. (arXiv:2302.01191v1 [math.OA])
    We propose a new generalization of neural networks with noncommutative $C^*$-algebra. An important feature of $C^*$-algebras is their noncommutative structure of products, but the existing $C^*$-algebra net frameworks have only considered commutative $C^*$-algebras. We show that this noncommutative structure of $C^*$-algebras induces powerful effects in learning neural networks. Our framework has a wide range of applications, such as learning multiple related neural networks simultaneously with interactions and learning invariant features with respect to group actions. We also show the validity of our framework numerically, which illustrates its potential power.  ( 2 min )
    Collaborating with language models for embodied reasoning. (arXiv:2302.00763v1 [cs.LG])
    Reasoning in a complex and ambiguous environment is a key goal for Reinforcement Learning (RL) agents. While some sophisticated RL agents can successfully solve difficult tasks, they require a large amount of training data and often struggle to generalize to new unseen environments and new tasks. On the other hand, Large Scale Language Models (LSLMs) have exhibited strong reasoning ability and the ability to to adapt to new tasks through in-context learning. However, LSLMs do not inherently have the ability to interrogate or intervene on the environment. In this work, we investigate how to combine these complementary abilities in a single system consisting of three parts: a Planner, an Actor, and a Reporter. The Planner is a pre-trained language model that can issue commands to a simple embodied agent (the Actor), while the Reporter communicates with the Planner to inform its next command. We present a set of tasks that require reasoning, test this system's ability to generalize zero-shot and investigate failure cases, and demonstrate how components of this system can be trained with reinforcement-learning to improve performance.
    Molecular Geometry-aware Transformer for accurate 3D Atomic System modeling. (arXiv:2302.00855v1 [q-bio.MN])
    Molecular dynamic simulations are important in computational physics, chemistry, material, and biology. Machine learning-based methods have shown strong abilities in predicting molecular energy and properties and are much faster than DFT calculations. Molecular energy is at least related to atoms, bonds, bond angles, torsion angles, and nonbonding atom pairs. Previous Transformer models only use atoms as inputs which lack explicit modeling of the aforementioned factors. To alleviate this limitation, we propose Moleformer, a novel Transformer architecture that takes nodes (atoms) and edges (bonds and nonbonding atom pairs) as inputs and models the interactions among them using rotational and translational invariant geometry-aware spatial encoding. Proposed spatial encoding calculates relative position information including distances and angles among nodes and edges. We benchmark Moleformer on OC20 and QM9 datasets, and our model achieves state-of-the-art on the initial state to relaxed energy prediction of OC20 and is very competitive in QM9 on predicting quantum chemical properties compared to other Transformer and Graph Neural Network (GNN) methods which proves the effectiveness of the proposed geometry-aware spatial encoding in Moleformer.
    FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features. (arXiv:2302.00787v1 [cs.LG])
    The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
    Rethinking Warm-Starts with Predictions: Learning Predictions Close to Sets of Optimal Solutions for Faster $\text{L}$-/$\text{L}^\natural$-Convex Function Minimization. (arXiv:2302.00928v1 [cs.LG])
    An emerging line of work has shown that machine-learned predictions are useful to warm-start algorithms for discrete optimization problems, such as bipartite matching. Previous studies have shown time complexity bounds proportional to some distance between a prediction and an optimal solution, which we can approximately minimize by learning predictions from past optimal solutions. However, such guarantees may not be meaningful when multiple optimal solutions exist. Indeed, the dual problem of bipartite matching and, more generally, $\text{L}$-/$\text{L}^\natural$-convex function minimization have arbitrarily many optimal solutions, making such prediction-dependent bounds arbitrarily large. To resolve this theoretically critical issue, we present a new warm-start-with-prediction framework for $\text{L}$-/$\text{L}^\natural$-convex function minimization. Our framework offers time complexity bounds proportional to the distance between a prediction and the set of all optimal solutions. The main technical difficulty lies in learning predictions that are provably close to sets of all optimal solutions, for which we present an online-gradient-descent-based method. We thus give the first polynomial-time learnability of predictions that can provably warm-start algorithms regardless of multiple optimal solutions.
    Mnemosyne: Learning to Train Transformers with Transformers. (arXiv:2302.01128v1 [cs.LG])
    Training complex machine learning (ML) architectures requires a compute and time consuming process of selecting the right optimizer and tuning its hyper-parameters. A new paradigm of learning optimizers from data has emerged as a better alternative to hand-designed ML optimizers. We propose Mnemosyne optimizer, that uses Performers: implicit low-rank attention Transformers. It can learn to train entire neural network architectures including other Transformers without any task-specific optimizer tuning. We show that Mnemosyne: (a) generalizes better than popular LSTM optimizer, (b) in particular can successfully train Vision Transformers (ViTs) while meta--trained on standard MLPs and (c) can initialize optimizers for faster convergence in Robotics applications. We believe that these results open the possibility of using Transformers to build foundational optimization models that can address the challenges of regular Transformer training. We complement our results with an extensive theoretical analysis of the compact associative memory used by Mnemosyne.  ( 2 min )
    Causal Lifting and Link Prediction. (arXiv:2302.01198v1 [cs.LG])
    Current state-of-the-art causal models for link prediction assume an underlying set of inherent node factors -- an innate characteristic defined at the node's birth -- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent, i.e., the outcome of link interventions depends on existing links. For instance, in the customer-product graph of an online retailer, the effect of an 85-inch TV ad (treatment) likely depends on whether the costumer already has an 85-inch TV. Unfortunately, existing causal methods are impractical in these scenarios. The cascading functional dependencies between links (due to path dependence) are either unidentifiable or require an impractical number of control variables. In order to remedy this shortcoming, this work develops the first causal model capable of dealing with path dependencies in link prediction. It introduces the concept of causal lifting, an invariance in causal models that, when satisfied, allows the identification of causal link prediction queries using limited interventional data. On the estimation side, we show how structural pairwise embeddings -- a type of symmetry-based joint representation of node pairs in a graph -- exhibit lower bias and correctly represent the causal structure of the task, as opposed to existing node embedding methods, e.g., GNNs and matrix factorization. Finally, we validate our theoretical findings on four datasets under three different scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.  ( 2 min )
    De Novo Molecular Generation via Connection-aware Motif Mining. (arXiv:2302.01129v1 [cs.LG])
    De novo molecular generation is an essential task for science discovery. Recently, fragment-based deep generative models have attracted much research attention due to their flexibility in generating novel molecules based on existing molecule fragments. However, the motif vocabulary, i.e., the collection of frequent fragments, is usually built upon heuristic rules, which brings difficulties to capturing common substructures from large amounts of molecules. In this work, we propose a new method, MiCaM, to generate molecules based on mined connection-aware motifs. Specifically, it leverages a data-driven algorithm to automatically discover motifs from a molecule library by iteratively merging subgraphs based on their frequency. The obtained motif vocabulary consists of not only molecular motifs (i.e., the frequent fragments), but also their connection information, indicating how the motifs are connected with each other. Based on the mined connection-aware motifs, MiCaM builds a connection-aware generator, which simultaneously picks up motifs and determines how they are connected. We test our method on distribution-learning benchmarks (i.e., generating novel molecules to resemble the distribution of a given training set) and goal-directed benchmarks (i.e., generating molecules with target properties), and achieve significant improvements over previous fragment-based baselines. Furthermore, we demonstrate that our method can effectively mine domain-specific motifs for different tasks.  ( 2 min )
    QCM-SGM+: Improved Quantized Compressed Sensing With Score-Based Generative Models for General Sensing Matrices. (arXiv:2302.00919v1 [eess.SP])
    In realistic compressed sensing (CS) scenarios, the obtained measurements usually have to be quantized to a finite number of bits before transmission and/or storage, thus posing a challenge in recovery, especially for extremely coarse quantization such as 1-bit sign measurements. Recently Meng & Kabashima proposed an efficient quantized compressed sensing algorithm called QCS-SGM using the score-based generative models as an implicit prior. Thanks to the power of score-based generative models in capturing the rich structure of the prior, QCS-SGM achieves remarkably better performances than previous quantized CS methods. However, QCS-SGM is restricted to (approximately) row-orthogonal sensing matrices since otherwise the likelihood score becomes intractable. To address this challenging problem, in this paper we propose an improved version of QCS-SGM, which we call QCS-SGM+, which also works well for general matrices. The key idea is a Bayesian inference perspective of the likelihood score computation, whereby an expectation propagation algorithm is proposed to approximately compute the likelihood score. Experiments on a variety of baseline datasets demonstrate that the proposed QCS-SGM+ outperforms QCS-SGM by a large margin when sensing matrices are far from row-orthogonal.
    An Enhanced V-cycle MgNet Model for Operator Learning in Numerical Partial Differential Equations. (arXiv:2302.00938v1 [cs.LG])
    This study used a multigrid-based convolutional neural network architecture known as MgNet in operator learning to solve numerical partial differential equations (PDEs). Given the property of smoothing iterations in multigrid methods where low-frequency errors decay slowly, we introduced a low-frequency correction structure for residuals to enhance the standard V-cycle MgNet. The enhanced MgNet model can capture the low-frequency features of solutions considerably better than the standard V-cycle MgNet. The numerical results obtained using some standard operator learning tasks are better than those obtained using many state-of-the-art methods, demonstrating the efficiency of our model.Moreover, numerically, our new model is more robust in case of low- and high-resolution data during training and testing, respectively.
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data. (arXiv:2302.00834v1 [cs.LG])
    We study the interpolation, or memorization, power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints.
    Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment. (arXiv:2302.00902v1 [cs.LG])
    Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image as sequences of text tokens by directly quantizing image embeddings using a pretrained language codebook. We then apply random masking followed by a BERT model, and have the decoder reconstruct the original image from BERT predicted text token embeddings. By doing so, LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features. To the best of our knowledge, our work is the first work that uses unaligned images for multimodal tasks by leveraging the power of pretrained language models.
    CLIPood: Generalizing CLIP to Out-of-Distributions. (arXiv:2302.00864v1 [cs.LG])
    Out-of-distribution (OOD) generalization, where the model needs to handle distribution shifts from training, is a major challenge of machine learning. Recently, contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, revealing a promising path toward OOD generalization. However, to boost upon zero-shot performance, further adaptation of CLIP on downstream tasks is indispensable but undesirably degrades OOD generalization ability. In this paper, we aim at generalizing CLIP to out-of-distribution test data on downstream tasks. Beyond the two canonical OOD situations, domain shift and open class, we tackle a more general but difficult in-the-wild setting where both OOD situations may occur on the unseen test data. We propose CLIPood, a simple fine-tuning method that can adapt CLIP models to all OOD situations. To exploit semantic relations between classes from the text modality, CLIPood introduces a new training objective, margin metric softmax (MMS), with class adaptive margins for fine-tuning. Moreover, to incorporate both the pre-trained zero-shot model and the fine-tuned task-adaptive model, CLIPood proposes a new Beta moving average (BMA) to maintain a temporal ensemble according to Beta distribution. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
    MTP-GO: Graph-Based Probabilistic Multi-Agent Trajectory Prediction with Neural ODEs. (arXiv:2302.00735v1 [cs.RO])
    Enabling resilient autonomous motion planning requires robust predictions of surrounding road users' future behavior. In response to this need and the associated challenges, we introduce our model, titled MTP-GO. The model encodes the scene using temporal graph neural networks to produce the inputs to an underlying motion model. The motion model is implemented using neural ordinary differential equations where the state-transition functions are learned with the rest of the model. Multi-modal probabilistic predictions are provided by combining the concept of mixture density networks and Kalman filtering. The results illustrate the predictive capabilities of the proposed model across various data sets, outperforming several state-of-the-art methods on a number of metrics.  ( 2 min )
    LMC: Fast Training of GNNs via Subgraph Sampling with Provable Convergence. (arXiv:2302.00924v1 [cs.LG])
    The message passing-based graph neural networks (GNNs) have achieved great success in many real-world applications. However, training GNNs on large-scale graphs suffers from the well-known neighbor explosion problem, i.e., the exponentially increasing dependencies of nodes with the number of message passing layers. Subgraph-wise sampling methods -- a promising class of mini-batch training techniques -- discard messages outside the mini-batches in backward passes to avoid the neighbor explosion problem at the expense of gradient estimation accuracy. This poses significant challenges to their convergence analysis and convergence speeds, which seriously limits their reliable real-world applications. To address this challenge, we propose a novel subgraph-wise sampling method with a convergence guarantee, namely Local Message Compensation (LMC). To the best of our knowledge, LMC is the {\it first} subgraph-wise sampling method with provable convergence. The key idea of LMC is to retrieve the discarded messages in backward passes based on a message passing formulation of backward passes. By efficient and effective compensations for the discarded messages in both forward and backward passes, LMC computes accurate mini-batch gradients and thus accelerates convergence. We further show that LMC converges to first-order stationary points of GNNs. Experiments on large-scale benchmark tasks demonstrate that LMC significantly outperforms state-of-the-art subgraph-wise sampling methods in terms of efficiency.  ( 2 min )
    Implicit regularization in Heavy-ball momentum accelerated stochastic gradient descent. (arXiv:2302.00849v1 [cs.LG])
    It is well known that the finite step-size ($h$) in Gradient Descent (GD) implicitly regularizes solutions to flatter minima. A natural question to ask is "Does the momentum parameter $\beta$ play a role in implicit regularization in Heavy-ball (H.B) momentum accelerated gradient descent (GD+M)?". To answer this question, first, we show that the discrete H.B momentum update (GD+M) follows a continuous trajectory induced by a modified loss, which consists of an original loss and an implicit regularizer. Then, we show that this implicit regularizer for (GD+M) is stronger than that of (GD) by factor of $(\frac{1+\beta}{1-\beta})$, thus explaining why (GD+M) shows better generalization performance and higher test accuracy than (GD). Furthermore, we extend our analysis to the stochastic version of gradient descent with momentum (SGD+M) and characterize the continuous trajectory of the update of (SGD+M) in a pointwise sense. We explore the implicit regularization in (SGD+M) and (GD+M) through a series of experiments validating our theory.
    The Contextual Lasso: Sparse Linear Models via Deep Neural Networks. (arXiv:2302.00878v1 [stat.ML])
    Sparse linear models are a gold standard tool for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which we wish to explain the model's predictions, and contextual features, which we wish to determine the model's explanations. This dichotomy leads us to propose the contextual lasso, a new statistical estimator that fits a sparse linear model whose sparsity pattern and coefficients can vary with the contextual features. The fitting process involves learning a nonparametric map, realized via a deep neural network, from contextual feature vector to sparse coefficient vector. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. Extensive experiments on real and synthetic data suggest that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.
    High-precision regressors for particle physics. (arXiv:2302.00753v1 [physics.comp-ph])
    Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ - $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than $1\%$ or even $0.1\%$ over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of $10^3$ - $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.
    Synthesizing Physical Character-Scene Interactions. (arXiv:2302.00883v1 [cs.GR])
    Movement is how people interact with and affect their environment. For realistic character animation, it is necessary to synthesize such interactions between virtual characters and their surroundings. Despite recent progress in character animation using machine learning, most systems focus on controlling an agent's movements in fairly simple and homogeneous environments, with limited interactions with other objects. Furthermore, many previous approaches that synthesize human-scene interactions require significant manual labeling of the training data. In contrast, we present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Our method learns scene interaction behaviors from large unstructured motion datasets, without manual annotation of the motion data. These scene interactions are learned using an adversarial discriminator that evaluates the realism of a motion within the context of a scene. The key novelty involves conditioning both the discriminator and the policy networks on scene context. We demonstrate the effectiveness of our approach through three challenging scene interaction tasks: carrying, sitting, and lying down, which require coordination of a character's movements in relation to objects in the environment. Our policies learn to seamlessly transition between different behaviors like idling, walking, and sitting. By randomizing the properties of the objects and their placements during training, our method is able to generalize beyond the objects and scenarios depicted in the training dataset, producing natural character-scene interactions for a wide variety of object shapes and placements. The approach takes physics-based character motion generation a step closer to broad applicability.
    STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition. (arXiv:2302.01172v1 [cs.LG])
    Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.  ( 2 min )
    Stochastic Contextual Bandits with Long Horizon Rewards. (arXiv:2302.00814v1 [cs.LG])
    The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
    Recurrent Graph Convolutional Networks for Spatiotemporal Prediction of Snow Accumulation Using Airborne Radar. (arXiv:2302.00817v1 [cs.LG])
    The accurate prediction and estimation of annual snow accumulation has grown in importance as we deal with the effects of climate change and the increase of global atmospheric temperatures. Airborne radar sensors, such as the Snow Radar, are able to measure accumulation rate patterns at a large-scale and monitor the effects of ongoing climate change on Greenland's precipitation and run-off. The Snow Radar's use of an ultra-wide bandwidth enables a fine vertical resolution that helps in capturing internal ice layers. Given the amount of snow accumulation in previous years using the radar data, in this paper, we propose a machine learning model based on recurrent graph convolutional networks to predict the snow accumulation in recent consecutive years at a certain location. We found that the model performs better and with more consistency than equivalent nongeometric and nontemporal models.
    Disentanglement of Latent Representations via Sparse Causal Interventions. (arXiv:2302.00869v1 [cs.LG])
    The process of generating data such as images is controlled by independent and unknown factors of variation. The retrieval of these variables has been studied extensively in the disentanglement, causal representation learning, and independent component analysis fields. Recently, approaches merging these domains together have shown great success. Instead of directly representing the factors of variation, the problem of disentanglement can be seen as finding the interventions on one image that yield a change to a single factor. Following this assumption, we introduce a new method for disentanglement inspired by causal dynamics that combines causality theory with vector-quantized variational autoencoders. Our model considers the quantized vectors as causal variables and links them in a causal graph. It performs causal interventions on the graph and generates atomic transitions affecting a unique factor of variation in the image. We also introduce a new task of action retrieval that consists of finding the action responsible for the transition between two images. We test our method on standard synthetic and real-world disentanglement datasets. We show that it can effectively disentangle the factors of variation and perform precise interventions on high-level semantic attributes of an image without affecting its quality, even with imbalanced data distributions.
    A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics. (arXiv:2302.00973v1 [stat.ML])
    In recent years, deep learning methods have achieved great success in various fields due to their strong performance in practical applications. In this paper, we present a light-weight neural network for Parkinson's disease diagnostics, in which a series of hand-drawn data are collected to distinguish Parkinson's disease patients from healthy control subjects. The proposed model consists of a convolution neural network (CNN) cascading to long-short-term memory (LSTM) to adapt the characteristics of collected time-series signals. To make full use of their advantages, a multilayered LSTM model is firstly used to enrich features which are then concatenated with raw data and fed into a shallow one-dimensional (1D) CNN model for efficient classification. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations, outperforming conventional methods such as support vector machine (SVM), random forest (RF), lightgbm (LGB) and CNN-based methods.
    Meta Learning in Decentralized Neural Networks: Towards More General AI. (arXiv:2302.01020v1 [cs.LG])
    Meta-learning usually refers to a learning algorithm that learns from other learning algorithms. The problem of uncertainty in the predictions of neural networks shows that the world is only partially predictable and a learned neural network cannot generalize to its ever-changing surrounding environments. Therefore, the question is how a predictive model can represent multiple predictions simultaneously. We aim to provide a fundamental understanding of learning to learn in the contents of Decentralized Neural Networks (Decentralized NNs) and we believe this is one of the most important questions and prerequisites to building an autonomous intelligence machine. To this end, we shall demonstrate several pieces of evidence for tackling the problems above with Meta Learning in Decentralized NNs. In particular, we will present three different approaches to building such a decentralized learning system: (1) learning from many replica neural networks, (2) building the hierarchy of neural networks for different functions, and (3) leveraging different modality experts to learn cross-modal representations.
    Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control. (arXiv:2302.00839v1 [cs.LG])
    Many real-world multi-label prediction problems involve set-valued predictions that must satisfy specific requirements dictated by downstream usage. We focus on a typical scenario where such requirements, separately encoding \textit{value} and \textit{cost}, compete with each other. For instance, a hospital might expect a smart diagnosis system to capture as many severe, often co-morbid, diseases as possible (the value), while maintaining strict control over incorrect predictions (the cost). We present a general pipeline, dubbed as FavMac, to maximize the value while controlling the cost in such scenarios. FavMac can be combined with almost any multi-label classifier, affording distribution-free theoretical guarantees on cost control. Moreover, unlike prior works, FavMac can handle real-world large-scale applications via a carefully designed online update mechanism, which is of independent interest. Our methodological and theoretical contributions are supported by experiments on several healthcare tasks and synthetic datasets - FavMac furnishes higher value compared with several variants and baselines while maintaining strict cost control.
    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery. (arXiv:2302.00755v1 [stat.ML])
    In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery.
    Privacy Risk for anisotropic Langevin dynamics using relative entropy bounds. (arXiv:2302.00766v1 [cs.LG])
    The privacy preserving properties of Langevin dynamics with additive isotropic noise have been extensively studied. However, the isotropic noise assumption is very restrictive: (a) when adding noise to existing learning algorithms to preserve privacy and maintain the best possible accuracy one should take into account the relative magnitude of the outputs and their correlations; (b) popular algorithms such as stochastic gradient descent (and their continuous time limits) appear to possess anisotropic covariance properties. To study the privacy risks for the anisotropic noise case, one requires general results on the relative entropy between the laws of two Stochastic Differential Equations with different drifts and diffusion coefficients. Our main contribution is to establish such a bound using stability estimates for solutions to the Fokker-Planck equations via functional inequalities. With additional assumptions, the relative entropy bound implies an $(\epsilon,\delta)$-differential privacy bound. We discuss the practical implications of our bound related to privacy risk in different contexts.Finally, the benefits of anisotropic noise are illustrated using numerical results on optimising a quadratic loss or calibrating a neural network.
    A Survey on Compositional Generalization in Applications. (arXiv:2302.01067v1 [cs.AI])
    The field of compositional generalization is currently experiencing a renaissance in AI, as novel problem settings and algorithms motivated by various practical applications are being introduced, building on top of the classical compositional generalization problem. This article aims to provide a comprehensive review of top recent developments in multiple real-life applications of the compositional generalization. Specifically, we introduce a taxonomy of common applications and summarize the state-of-the-art for each of those domains. Furthermore, we identify important current trends and provide new perspectives pertaining to the future of this burgeoning field.  ( 2 min )
    Pathologies of Predictive Diversity in Deep Ensembles. (arXiv:2302.00704v1 [cs.LG])
    Classical results establish that ensembles of small models benefit when predictive diversity is encouraged, through bagging, boosting, and similar. Here we demonstrate that this intuition does not carry over to ensembles of deep neural networks used for classification, and in fact the opposite can be true. Unlike regression models or small (unconfident) classifiers, predictions from large (confident) neural networks concentrate in vertices of the probability simplex. Thus, decorrelating these points necessarily moves the ensemble prediction away from vertices, harming confidence and moving points across decision boundaries. Through large scale experiments, we demonstrate that diversity-encouraging regularizers hurt the performance of high-capacity deep ensembles used for classification. Even more surprisingly, discouraging predictive diversity can be beneficial. Together this work strongly suggests that the best strategy for deep ensembles is utilizing more accurate, but likely less diverse, component models.
    Generative Modeling with Quantum Neurons. (arXiv:2302.00788v1 [quant-ph])
    The recently proposed Quantum Neuron Born Machine (QNBM) has demonstrated quality initial performance as the first quantum generative machine learning (ML) model proposed with non-linear activations. However, previous investigations have been limited in scope with regards to the model's learnability and simulatability. In this work, we make a considerable leap forward by providing an extensive deep dive into the QNBM's potential as a generative model. We first demonstrate that the QNBM's network representation makes it non-trivial to be classically efficiently simulated. Following this result, we showcase the model's ability to learn (express and train on) a wider set of probability distributions, and benchmark the performance against a classical Restricted Boltzmann Machine (RBM). The QNBM is able to outperform this classical model on all distributions, even for the most optimally trained RBM among our simulations. Specifically, the QNBM outperforms the RBM with an improvement factor of 75.3x, 6.4x, and 3.5x for the discrete Gaussian, cardinality-constrained, and Bars and Stripes distributions respectively. Lastly, we conduct an initial investigation into the model's generalization capabilities and use a KL test to show that the model is able to approximate the ground truth probability distribution more closely than the training distribution when given access to a limited amount of data. Overall, we put forth a stronger case in support of using the QNBM for larger-scale generative tasks.
    Causal Effect Estimation: Recent Advances, Challenges, and Opportunities. (arXiv:2302.00848v1 [cs.LG])
    Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives.
    The Weisfeiler-Lehman Distance: Reinterpretation and Connection with GNNs. (arXiv:2302.00713v1 [cs.LG])
    In this paper, we present a novel interpretation of the so-called Weisfeiler-Lehman (WL) distance, introduced by Chen et al. (2022), using concepts from stochastic processes. The WL distance aims at comparing graphs with node features, has the same discriminative power as the classic Weisfeiler-Lehman graph isomorphism test and has deep connections to the Gromov-Wasserstein distance. This new interpretation connects the WL distance to the literature on distances for stochastic processes, which also makes the interpretation of the distance more accessible and intuitive. We further explore the connections between the WL distance and certain Message Passing Neural Networks, and discuss the implications of the WL distance for understanding the Lipschitz property and the universal approximation results for these networks.
    ImageNomer: developing an fMRI and omics visualization tool to detect racial bias in functional connectivity. (arXiv:2302.00767v1 [q-bio.PE])
    It can be difficult to identify trends and perform quality control in large, high-dimensional fMRI or omics datasets. To remedy this, we develop ImageNomer, a data visualization and analysis tool that allows inspection of both subject-level and cohort-level features. The tool allows visualization of phenotype correlation with functional connectivity (FC), partial connectivity (PC), dictionary components (PCA and our own method), and genomic data (single-nucleotide polymorphisms, SNPs). In addition, it allows visualization of weights from arbitrary ML models. ImageNomer is built with a Python backend and a Vue frontend. We validate ImageNomer using the Philadelphia Neurodevelopmental Cohort (PNC) dataset, which contains multitask fMRI and SNP data of healthy adolescents. Using correlation, greedy selection, or model weights, we find that a set of 10 FC features can explain 15% of variation in age, compared to 35% for the full 34,716 feature model. The four most significant FCs are either between bilateral default mode network (DMN) regions or spatially proximal subcortical areas. Additionally, we show that whereas both FC (fMRI) and SNPs (genomic) features can account for 10-15% of intelligence variation, this predictive ability disappears when controlling for race. We find that FC features can be used to predict race with 85% accuracy, compared to 78% accuracy for sex prediction. Using ImageNomer, this work casts doubt on the possibility of finding unbiased intelligence-related features in fMRI and SNPs of healthy adolescents.
    Riemannian Stochastic Approximation for Minimizing Tame Nonsmooth Objective Functions. (arXiv:2302.00709v1 [cs.LG])
    In many learning applications, the parameters in a model are structurally constrained in a way that can be modeled as them lying on a Riemannian manifold. Riemannian optimization, wherein procedures to enforce an iterative minimizing sequence to be constrained to the manifold, is used to train such models. At the same time, tame geometry has become a significant topological description of nonsmooth functions that appear in the landscapes of training neural networks and other important models with structural compositions of continuous nonlinear functions with nonsmooth maps. In this paper, we study the properties of such stratifiable functions on a manifold and the behavior of retracted stochastic gradient descent, with diminishing stepsizes, for minimizing such functions.
    Sample Complexity of Kernel-Based Q-Learning. (arXiv:2302.00727v1 [cs.LG])
    Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
    Approximating the Shapley Value without Marginal Contributions. (arXiv:2302.00736v1 [cs.LG])
    The Shapley value is arguably the most popular approach for assigning a meaningful contribution value to players in a cooperative game, which has recently been used intensively in various areas of machine learning, most notably in explainable artificial intelligence. The meaningfulness is due to axiomatic properties that only the Shapley value satisfies, which, however, comes at the expense of an exact computation growing exponentially with the number of agents. Accordingly, a number of works are devoted to the efficient approximation of the Shapley values, all of which revolve around the notion of an agent's marginal contribution. In this paper, we propose with SVARM and Stratified SVARM two parameter-free and domain-independent approximation algorithms based on a representation of the Shapley value detached from the notion of marginal contributions. We prove unmatched theoretical guarantees regarding their approximation quality and provide satisfying empirical results.
    Neural Networks for Symbolic Regression. (arXiv:2302.00773v1 [cs.NE])
    Many real-world systems can be described by mathematical formulas that are human-comprehensible, easy to analyze and can be helpful in explaining the system's behaviour. Symbolic regression is a method that generates nonlinear models from data in the form of analytic expressions. Historically, symbolic regression has been predominantly realized using genetic programming, a method that iteratively evolves a population of candidate solutions that are sampled by genetic operators crossover and mutation. This gradient-free evolutionary approach suffers from several deficiencies: it does not scale well with the number of variables and samples in the training data, models tend to grow in size and complexity without an adequate accuracy gain, and it is hard to fine-tune the inner model coefficients using just genetic operators. Recently, neural networks have been applied to learn the whole analytic formula, i.e., its structure as well as the coefficients, by means of gradient-based optimization algorithms. We propose a novel neural network-based symbolic regression method that constructs physically plausible models based on limited training data and prior knowledge about the system. The method employs an adaptive weighting scheme to effectively deal with multiple loss function terms and an epoch-wise learning process to reduce the chance of getting stuck in poor local optima. Furthermore, we propose a parameter-free method for choosing the model with the best interpolation and extrapolation performance out of all models generated through the whole learning process. We experimentally evaluate the approach on the TurtleBot 2 mobile robot, the magnetic manipulation system, the equivalent resistance of two resistors in parallel, and the anti-lock braking system. The results clearly show the potential of the method to find sparse and accurate models that comply with the prior knowledge provided.
    A Survey of Deep Learning: From Activations to Transformers. (arXiv:2302.00722v1 [cs.LG])
    Deep learning has made tremendous progress in the last decade. A key success factor is the large amount of architectures, layers, objectives, and optimization techniques that have emerged in recent years. They include a myriad of variants related to attention, normalization, skip connection, transformer and self-supervised learning schemes -- to name a few. We provide a comprehensive overview of the most important, recent works in these areas to those who already have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential, recent works helps researchers to form new connections between diverse areas of deep learning.
    Universal Soldier: Using Universal Adversarial Perturbations for Detecting Backdoor Attacks. (arXiv:2302.00747v1 [cs.LG])
    Deep learning models achieve excellent performance in numerous machine learning tasks. Yet, they suffer from security-related issues such as adversarial examples and poisoning (backdoor) attacks. A deep learning model may be poisoned by training with backdoored data or by modifying inner network parameters. Then, a backdoored model performs as expected when receiving a clean input, but it misclassifies when receiving a backdoored input stamped with a pre-designed pattern called "trigger". Unfortunately, it is difficult to distinguish between clean and backdoored models without prior knowledge of the trigger. This paper proposes a backdoor detection method by utilizing a special type of adversarial attack, universal adversarial perturbation (UAP), and its similarities with a backdoor trigger. We observe an intuitive phenomenon: UAPs generated from backdoored models need fewer perturbations to mislead the model than UAPs from clean models. UAPs of backdoored models tend to exploit the shortcut from all classes to the target class, built by the backdoor trigger. We propose a novel method called Universal Soldier for Backdoor detection (USB) and reverse engineering potential backdoor triggers via UAPs. Experiments on 345 models trained on several datasets show that USB effectively detects the injected backdoor and provides comparable or better results than state-of-the-art methods.
    Domain Generalization Emerges from Dreaming. (arXiv:2302.00980v1 [cs.CV])
    Recent studies have proven that DNNs, unlike human vision, tend to exploit texture information rather than shape. Such texture bias is one of the factors for the poor generalization performance of DNNs. We observe that the texture bias negatively affects not only in-domain generalization but also out-of-distribution generalization, i.e., Domain Generalization. Motivated by the observation, we propose a new framework to reduce the texture bias of a model by a novel optimization-based data augmentation, dubbed Stylized Dream. Our framework utilizes adaptive instance normalization (AdaIN) to augment the style of an original image yet preserve the content. We then adopt a regularization loss to predict consistent outputs between Stylized Dream and original images, which encourages the model to learn shape-based representations. Extensive experiments show that the proposed method achieves state-of-the-art performance in out-of-distribution settings on public benchmark datasets: PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet.  ( 2 min )
    Real-Time Evaluation in Online Continual Learning: A New Paradigm. (arXiv:2302.01047v1 [cs.LG])
    Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We hypothesize that under this new evaluation paradigm, computationally demanding CL approaches may perform poorly on streams with a varying distribution. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods.  ( 2 min )
    Avoiding Model Estimation in Robust Markov Decision Processes with a Generative Model. (arXiv:2302.01248v1 [stat.ML])
    Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.
    Randomized prior wavelet neural operator for uncertainty quantification. (arXiv:2302.01051v1 [stat.ML])
    In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework.  ( 2 min )
    Reinforcement learning-based estimation for partial differential equations. (arXiv:2302.01189v1 [cs.LG])
    In systems governed by nonlinear partial differential equations such as fluid flows, the design of state estimators such as Kalman filters relies on a reduced-order model (ROM) that projects the original high-dimensional dynamics onto a computationally tractable low-dimensional space. However, ROMs are prone to large errors, which negatively affects the performance of the estimator. Here, we introduce the reinforcement learning reduced-order estimator (RL-ROE), a ROM-based estimator in which the correction term that takes in the measurements is given by a nonlinear policy trained through reinforcement learning. The nonlinearity of the policy enables the RL-ROE to compensate efficiently for errors of the ROM, while still taking advantage of the imperfect knowledge of the dynamics. Using examples involving the Burgers and Navier-Stokes equations, we show that in the limit of very few sensors, the trained RL-ROE outperforms a Kalman filter designed using the same ROM. Moreover, it yields accurate high-dimensional state estimates for reference trajectories corresponding to various physical parameter values, without direct knowledge of the latter.  ( 2 min )
    Neural Common Neighbor with Completion for Link Prediction. (arXiv:2302.00890v1 [cs.LG])
    Despite its outstanding performance in various graph tasks, vanilla Message Passing Neural Network (MPNN) usually fails in link prediction tasks, as it only uses representations of two individual target nodes and ignores the pairwise relation between them. To capture the pairwise relations, some models add manual features to the input graph and use the output of MPNN to produce pairwise representations. In contrast, others directly use manual features as pairwise representations. Though this simplification avoids applying a GNN to each link individually and thus improves scalability, these models still have much room for performance improvement due to the hand-crafted and unlearnable pairwise features. To upgrade performance while maintaining scalability, we propose Neural Common Neighbor (NCN), which uses learnable pairwise representations. To further boost NCN, we study the unobserved link problem. The incompleteness of the graph is ubiquitous and leads to distribution shifts between the training and test set, loss of common neighbor information, and performance degradation of models. Therefore, we propose two intervention methods: common neighbor completion and target link removal. Combining the two methods with NCN, we propose Neural Common Neighbor with Completion (NCNC). NCN and NCNC outperform recent strong baselines by large margins. NCNC achieves state-of-the-art performance in link prediction tasks.  ( 2 min )
    Energy Efficient Training of SNN using Local Zeroth Order Method. (arXiv:2302.00910v1 [cs.LG])
    Spiking neural networks are becoming increasingly popular for their low energy requirement in real-world tasks with accuracy comparable to the traditional ANNs. SNN training algorithms face the loss of gradient information and non-differentiability due to the Heaviside function in minimizing the model loss over model parameters. To circumvent the problem surrogate method uses a differentiable approximation of the Heaviside in the backward pass, while the forward pass uses the Heaviside as the spiking function. We propose to use the zeroth order technique at the neuron level to resolve this dichotomy and use it within the automatic differentiation tool. As a result, we establish a theoretical connection between the proposed local zeroth-order technique and the existing surrogate methods and vice-versa. The proposed method naturally lends itself to energy-efficient training of SNNs on GPUs. Experimental results with neuromorphic datasets show that such implementation requires less than 1 percent neurons to be active in the backward pass, resulting in a 100x speed-up in the backward computation time. Our method offers better generalization compared to the state-of-the-art energy-efficient technique while maintaining similar efficiency.  ( 2 min )
    Vectorized Scenario Description and Motion Prediction for Scenario-Based Testing. (arXiv:2302.01161v1 [cs.LG])
    Automated vehicles (AVs) are tested in diverse scenarios, typically specified by parameters such as velocities, distances, or curve radii. To describe scenarios uniformly independent of such parameters, this paper proposes a vectorized scenario description defined by the road geometry and vehicles' trajectories. Data of this form are generated for three scenarios, merged, and used to train the motion prediction model VectorNet, allowing to predict an AV's trajectory for unseen scenarios. Predicting scenario evaluation metrics, VectorNet partially achieves lower errors than regression models that separately process the three scenarios' data. However, for comprehensive generalization, sufficient variance in the training data must be ensured. Thus, contrary to existing methods, our proposed method can merge diverse scenarios' data and exploit spatial and temporal nuances in the vectorized scenario description. As a result, data from specified test scenarios and real-world scenarios can be compared and combined for (predictive) analyses and scenario selection.  ( 2 min )
    Deep COVID-19 Forecasting for Multiple States with Data Augmentation. (arXiv:2302.01155v1 [cs.LG])
    In this work, we propose a deep learning approach to forecasting state-level COVID-19 trends of weekly cumulative death in the United States (US) and incident cases in Germany. This approach includes a transformer model, an ensemble method, and a data augmentation technique for time series. We arrange the inputs of the transformer in such a way that predictions for different states can attend to the trends of the others. To overcome the issue of scarcity of training data for this COVID-19 pandemic, we have developed a novel data augmentation technique to generate useful data for training. More importantly, the generated data can also be used for model validation. As such, it has a two-fold advantage: 1) more actual observations can be used for training, and 2) the model can be validated on data which has distribution closer to the expected situation. Our model has achieved some of the best state-level results on the COVID-19 Forecast Hub for the US and for Germany.  ( 2 min )
    Predicting the Silent Majority on Graphs: Knowledge Transferable Graph Neural Network. (arXiv:2302.00873v1 [cs.LG])
    Graphs consisting of vocal nodes ("the vocal minority") and silent nodes ("the silent majority"), namely VS-Graph, are ubiquitous in the real world. The vocal nodes tend to have abundant features and labels. In contrast, silent nodes only have incomplete features and rare labels, e.g., the description and political tendency of politicians (vocal) are abundant while not for ordinary people (silent) on the twitter's social network. Predicting the silent majority remains a crucial yet challenging problem. However, most existing message-passing based GNNs assume that all nodes belong to the same domain, without considering the missing features and distribution-shift between domains, leading to poor ability to deal with VS-Graph. To combat the above challenges, we propose Knowledge Transferable Graph Neural Network (KT-GNN), which models distribution shifts during message passing and representation learning by transferring knowledge from vocal nodes to silent nodes. Specifically, we design the domain-adapted "feature completion and message passing mechanism" for node representation learning while preserving domain difference. And a knowledge transferable classifier based on KL-divergence is followed. Comprehensive experiments on real-world scenarios (i.e., company financial risk assessment and political elections) demonstrate the superior performance of our method. Our source code has been open sourced.
    Site-specific Deep Learning Path Loss Models based on the Method of Moments. (arXiv:2302.01052v1 [cs.LG])
    This paper describes deep learning models based on convolutional neural networks applied to the problem of predicting EM wave propagation over rural terrain. A surface integral equation formulation, solved with the method of moments and accelerated using the Fast Far Field approximation, is used to generate synthetic training data which comprises path loss computed over randomly generated 1D terrain profiles. These are used to train two networks, one based on fractal profiles and one based on profiles generated using a Gaussian process. The models show excellent agreement when applied to test profiles generated using the same statistical process used to create the training data and very good accuracy when applied to real life problems.
    adSformers: Personalization from Short-Term Sequences and Diversity of Representations in Etsy Ads. (arXiv:2302.01255v1 [cs.LG])
    In this article, we present our approach to personalizing Etsy Ads through encoding and learning from short-term (one-hour) sequences of user actions and diverse representations. To this end we introduce a three-component adSformer diversifiable personalization module (ADPM) and illustrate how we use this module to derive a short-term dynamic user representation and personalize the Click-Through Rate (CTR) and Post-Click Conversion Rate (PCCVR) models used in sponsored search (ad) ranking. The first component of the ADPM is a custom transformer encoder that learns the inherent structure from the sequence of actions. ADPM's second component enriches the signal through visual, multimodal and textual pretrained representations. Lastly, the third ADPM component includes a "learned" on the fly average pooled representation. The ADPM-personalized CTR and PCCVR models, henceforth referred to as adSformer CTR and adSformer PCCVR, outperform the CTR and PCCVR production baselines by $+6.65\%$ and $+12.70\%$, respectively, in offline Precision-Recall Area Under the Curve (PR AUC). At the time of this writing, following the online gains in A/B tests, such as $+5.34\%$ in return on ad spend, a seller success metric, we are ramping up the adSformers to $100\%$ traffic in Etsy Ads.
    ReLOAD: Reinforcement Learning with Optimistic Ascent-Descent for Last-Iterate Convergence in Constrained MDPs. (arXiv:2302.01275v1 [cs.LG])
    In recent years, Reinforcement Learning (RL) has been applied to real-world problems with increasing success. Such applications often require to put constraints on the agent's behavior. Existing algorithms for constrained RL (CRL) rely on gradient descent-ascent, but this approach comes with a caveat. While these algorithms are guaranteed to converge on average, they do not guarantee last-iterate convergence, i.e., the current policy of the agent may never converge to the optimal solution. In practice, it is often observed that the policy alternates between satisfying the constraints and maximizing the reward, rarely accomplishing both objectives simultaneously. Here, we address this problem by introducing Reinforcement Learning with Optimistic Ascent-Descent (ReLOAD), a principled CRL method with guaranteed last-iterate convergence. We demonstrate its empirical effectiveness on a wide variety of CRL problems including discrete MDPs and continuous control. In the process we establish a benchmark of challenging CRL problems.
    Physics Constrained Motion Prediction with Uncertainty Quantification. (arXiv:2302.01060v1 [cs.RO])
    Predicting the motion of dynamic agents is a critical task for guaranteeing the safety of autonomous systems. A particular challenge is that motion prediction algorithms should obey dynamics constraints and quantify prediction uncertainty as a measure of confidence. We present a physics-constrained approach for motion prediction which uses a surrogate dynamical model to ensure that predicted trajectories are dynamically feasible. We propose a two-step integration consisting of intent and trajectory prediction subject to dynamics constraints. We also construct prediction regions that quantify uncertainty and are tailored for autonomous driving by using conformal prediction, a popular statistical tool. Physics Constrained Motion Prediction achieves a 41% better ADE, 56% better FDE, and 19% better IoU over a baseline in experiments using an autonomous racing dataset.  ( 2 min )
    Randomized Greedy Learning for Non-monotone Stochastic Submodular Maximization Under Full-bandit Feedback. (arXiv:2302.01324v1 [cs.LG])
    We investigate the problem of unconstrained combinatorial multi-armed bandits with full-bandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$-regret upper bound of $\tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings.
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v1 [cs.LG])
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.
    Bayesian Inference on Binary Spiking Networks Leveraging Nanoscale Device Stochasticity. (arXiv:2302.01302v1 [cs.NE])
    Bayesian Neural Networks (BNNs) can overcome the problem of overconfidence that plagues traditional frequentist deep neural networks, and are hence considered to be a key enabler for reliable AI systems. However, conventional hardware realizations of BNNs are resource intensive, requiring the implementation of random number generators for synaptic sampling. Owing to their inherent stochasticity during programming and read operations, nanoscale memristive devices can be directly leveraged for sampling, without the need for additional hardware resources. In this paper, we introduce a novel Phase Change Memory (PCM)-based hardware implementation for BNNs with binary synapses. The proposed architecture consists of separate weight and noise planes, in which PCM cells are configured and operated to represent the nominal values of weights and to generate the required noise for sampling, respectively. Using experimentally observed PCM noise characteristics, for the exemplary Breast Cancer Dataset classification problem, we obtain hardware accuracy and expected calibration error matching that of an 8-bit fixed-point (FxP8) implementation, with projected savings of over 9$\times$ in terms of core area transistor count.
    On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance. (arXiv:2302.01029v1 [cs.LG])
    A number of recent adaptive optimizers improve the generalisation performance of Adam by essentially reducing the variance of adaptive stepsizes to get closer to SGD with momentum. Following the above motivation, we suppress the range of the adaptive stepsizes of Adam by exploiting the layerwise gradient statistics. In particular, at each iteration, we propose to perform three consecutive operations on the second momentum v_t before using it to update a DNN model: (1): down-scaling, (2): epsilon-embedding, and (3): down-translating. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. The down-scaling operation on v_t is performed layerwise by making use of the angles between the layerwise subvectors of v_t and the corresponding all-one subvectors. Extensive experimental results show that SET-Adam outperforms eight adaptive optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the eight adaptive methods when training WGAN-GP models for image generation tasks. Furthermore, SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.  ( 2 min )
    Exposing the CSI: A Systematic Investigation of CSI-based Wi-Fi Sensing Capabilities and Limitations. (arXiv:2302.00992v1 [cs.NI])
    Thanks to the ubiquitous deployment of Wi-Fi hotspots, channel state information (CSI)-based Wi-Fi sensing can unleash game-changing applications in many fields, such as healthcare, security, and entertainment. However, despite one decade of active research on Wi-Fi sensing, most existing work only considers legacy IEEE 802.11n devices, often in particular and strictly-controlled environments. Worse yet, there is a fundamental lack of understanding of the impact on CSI-based sensing of modern Wi-Fi features, such as 160-MHz bandwidth, multiple-input multiple-output (MIMO) transmissions, and increased spectral resolution in IEEE 802.11ax (Wi-Fi 6). This work aims to shed light on the impact of Wi-Fi 6 features on the sensing performance and to create a benchmark for future research on Wi-Fi sensing. To this end, we perform an extensive CSI data collection campaign involving 3 individuals, 3 environments, and 12 activities, using Wi-Fi 6 signals. An anonymized ground truth obtained through video recording accompanies our 80-GB dataset, which contains almost two hours of CSI data from three collectors. We leverage our dataset to dissect the performance of a state-of-the-art sensing framework across different environments and individuals. Our key findings suggest that (i) MIMO transmissions and higher spectral resolution might be more beneficial than larger bandwidth for sensing applications; (ii) there is a pressing need to standardize research on Wi-Fi sensing because the path towards a truly environment-independent framework is still uncertain. To ease the experiments' replicability and address the current lack of Wi-Fi 6 CSI datasets, we release our 80-GB dataset to the community.  ( 2 min )
    Oracle-Preserving Latent Flows. (arXiv:2302.00806v1 [cs.LG])
    We develop a deep learning methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset. The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function ensuring the desired symmetry properties. The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles. The method is demonstrated with several examples on the MNIST digit dataset.  ( 2 min )
    Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning. (arXiv:2302.00797v1 [cs.AI])
    Multiagent reinforcement learning (MARL) has benefited significantly from population-based and game-theoretic training regimes. One approach, Policy-Space Response Oracles (PSRO), employs standard reinforcement learning to compute response policies via approximate best responses and combines them via meta-strategy selection. We augment PSRO by adding a novel search procedure with generative sampling of world states, and introduce two new meta-strategy solvers based on the Nash bargaining solution. We evaluate PSRO's ability to compute approximate Nash equilibrium, and its performance in two negotiation games: Colored Trails, and Deal or No Deal. We conduct behavioral studies where human participants negotiate with our agents ($N = 346$). We find that search with generative modeling finds stronger policies during both training time and test time, enables online Bayesian co-player prediction, and can produce agents that achieve comparable social welfare negotiating with humans as humans trading among themselves.
    Average-Constrained Policy Optimization. (arXiv:2302.00808v1 [cs.LG])
    Reinforcement Learning (RL) with constraints is becoming an increasingly important problem for various applications. Often, the average criterion is more suitable. Yet, RL for average criterion-constrained MDPs remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new (possibly the first) policy optimization algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by the famed PPO-type algorithms based on trust region methods. We develop basic sensitivity theory for average MDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging MuJoCo environments, show the superior performance of the algorithm when compared to other state-of-the-art algorithms adapted for the average CMDP setting.
    RobustNeRF: Ignoring Distractors with Robust Losses. (arXiv:2302.00833v1 [cs.CV])
    Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io/public.
    Scale up with Order: Finding Good Data Permutations for Distributed Training. (arXiv:2302.00845v1 [cs.LG])
    Gradient Balancing (GraB) is a recently proposed technique that finds provably better data permutations when training models with multiple epochs over a finite dataset. It converges at a faster rate than the widely adopted Random Reshuffling, by minimizing the discrepancy of the gradients on adjacently selected examples. However, GraB only operates under critical assumptions such as small batch sizes and centralized data, leaving open the question of how to order examples at large scale -- i.e. distributed learning with decentralized data. To alleviate the limitation, in this paper we propose D-GraB that involves two novel designs: (1) $\textsf{PairBalance}$ that eliminates the requirement to use stale gradient mean in GraB which critically relies on small learning rates; (2) an ordering protocol that runs $\textsf{PairBalance}$ in a distributed environment with negligible overhead, which benefits from both data ordering and parallelism. We prove D-GraB enjoys linear speed up at rate $\tilde{O}((mnT)^{-2/3})$ on smooth non-convex objectives and $\tilde{O}((mnT)^{-2})$ under PL condition, where $n$ denotes the number of parallel workers, $m$ denotes the number of examples per worker and $T$ denotes the number of epochs. Empirically, we show on various applications including GLUE, CIFAR10 and WikiText-2 that D-GraB outperforms naive parallel GraB and Distributed Random Reshuffling in terms of both training and validation performance.
    Quantum Graph Learning: Frontiers and Outlook. (arXiv:2302.00892v1 [cs.LG])
    Quantum theory has shown its superiority in enhancing machine learning. However, facilitating quantum theory to enhance graph learning is in its infancy. This survey investigates the current advances in quantum graph learning (QGL) from three perspectives, i.e., underlying theories, methods, and prospects. We first look at QGL and discuss the mutualism of quantum theory and graph learning, the specificity of graph-structured data, and the bottleneck of graph learning, respectively. A new taxonomy of QGL is presented, i.e., quantum computing on graphs, quantum graph representation, and quantum circuits for graph neural networks. Pitfall traps are then highlighted and explained. This survey aims to provide a brief but insightful introduction to this emerging field, along with a detailed discussion of frontiers and outlook yet to be investigated.
    Teaching MLOps in Higher Education through Project-Based Learning. (arXiv:2302.01048v1 [cs.SE])
    Building and maintaining production-grade ML-enabled components is a complex endeavor that goes beyond the current approach of academic education, focused on the optimization of ML model performance in the lab. In this paper, we present a project-based learning approach to teaching MLOps, focused on the demonstration and experience with emerging practices and tools to automatize the construction of ML-enabled components. We examine the design of a course based on this approach, including laboratory sessions that cover the end-to-end ML component life cycle, from model building to production deployment. Moreover, we report on preliminary results from the first edition of the course. During the present year, an updated version of the same course is being delivered in two independent universities; the related learning outcomes will be evaluated to analyze the effectiveness of project-based learning for this specific subject.
    SimMTM: A Simple Pre-Training Framework for Masked Time-Series Modeling. (arXiv:2302.00861v1 [cs.LG])
    Time series analysis is widely used in extensive areas. Recently, to reduce labeling expenses and benefit various tasks, self-supervised pre-training has attracted immense interest. One mainstream paradigm is masked modeling, which successfully pre-trains deep models by learning to reconstruct the masked content based on the unmasked part. However, since the semantic information of time series is mainly contained in temporal variations, the standard way of randomly masking a portion of time points will ruin vital temporal variations of time series seriously, making the reconstruction task too difficult to guide representation learning. We thus present SimMTM, a Simple pre-training framework for Masked Time-series Modeling. By relating masked modeling to manifold learning, SimMTM proposes to recover masked time points by the weighted aggregation of multiple neighbors outside the manifold, which eases the reconstruction task by assembling ruined but complementary temporal variations from multiple masked series. SimMTM further learns to uncover the local structure of the manifold helpful for masked modeling. Experimentally, SimMTM achieves state-of-the-art fine-tuning performance in two canonical time series analysis tasks: forecasting and classification, covering both in- and cross-domain settings.
    Resilient Binary Neural Network. (arXiv:2302.00956v1 [cs.LG])
    Binary neural networks (BNNs) have received ever-increasing popularity for their great capability of reducing storage burden as well as quickening inference time. However, there is a severe performance drop compared with {real-valued} networks, due to its intrinsic frequent weight oscillation during training. In this paper, we introduce a Resilient Binary Neural Network (ReBNN) to mitigate the frequent oscillation for better BNNs' training. We identify that the weight oscillation mainly stems from the non-parametric scaling factor. To address this issue, we propose to parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive training objective. %To the best of our knowledge, it is the first work to solve BNNs based on a dynamically re-weighted loss function. For the first time, we show that the weight oscillation is controlled by the balanced parameter attached to the reconstruction loss, which provides a theoretical foundation to parameterize it in back propagation. Based on this, we learn our ReBNN by {calculating} the {balanced} parameter {based on} its maximum magnitude, which can effectively mitigate the weight oscillation with a resilient training process. Extensive experiments are conducted upon various network models, such as ResNet and Faster-RCNN for computer vision, as well as BERT for natural language processing. The results demonstrate the overwhelming performance of our ReBNN over prior arts. For example, our ReBNN achieves 66.9\% Top-1 accuracy with ResNet-18 backbone on the ImageNet dataset, surpassing existing state-of-the-arts by a significant margin. Our code is open-sourced at https://github.com/SteveTsui/ReBNN.
    Algorithm Design for Online Meta-Learning with Task Boundary Detection. (arXiv:2302.00857v1 [cs.LG])
    Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work, we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update in order to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. In particular, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. We further show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches.
    Unpaired Multi-Domain Causal Representation Learning. (arXiv:2302.00993v1 [stat.ML])
    The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our identifiability results into a practical method to recover the shared latent causal graph. Moreover, we study how multiple domains reduce errors in falsely detecting shared causal variables in the finite data setting.
    Deep-Learning Tool for Early Identifying Non-Traumatic Intracranial Hemorrhage Etiology based on CT Scan. (arXiv:2302.00953v1 [eess.IV])
    Background: To develop an artificial intelligence system that can accurately identify acute non-traumatic intracranial hemorrhage (ICH) etiology based on non-contrast CT (NCCT) scans and investigate whether clinicians can benefit from it in a diagnostic setting. Materials and Methods: The deep learning model was developed with 1868 eligible NCCT scans with non-traumatic ICH collected between January 2011 and April 2018. We tested the model on two independent datasets (TT200 and SD 98) collected after April 2018. The model's diagnostic performance was compared with clinicians's performance. We further designed a simulated study to compare the clinicians's performance with and without the deep learning system augmentation. Results: The proposed deep learning system achieved area under the receiver operating curve of 0.986 (95% CI 0.967-1.000) on aneurysms, 0.952 (0.917-0.987) on hypertensive hemorrhage, 0.950 (0.860-1.000) on arteriovenous malformation (AVM), 0.749 (0.586-0.912) on Moyamoya disease (MMD), 0.837 (0.704-0.969) on cavernous malformation (CM), and 0.839 (0.722-0.959) on other causes in TT200 dataset. Given a 90% specificity level, the sensitivities of our model were 97.1% and 90.9% for aneurysm and AVM diagnosis, respectively. The model also shows an impressive generalizability in an independent dataset SD98. The clinicians achieve significant improvements in the sensitivity, specificity, and accuracy of diagnoses of certain hemorrhage etiologies with proposed system augmentation. Conclusions: The proposed deep learning algorithms can be an effective tool for early identification of hemorrhage etiologies based on NCCT scans. It may also provide more information for clinicians for triage and further imaging examination selection.
    Versatile Energy-Based Models for High Energy Physics. (arXiv:2302.00695v1 [cs.LG])
    Energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier.
  • Open

    Algorithm Design for Online Meta-Learning with Task Boundary Detection. (arXiv:2302.00857v1 [cs.LG])
    Online meta-learning has recently emerged as a marriage between batch meta-learning and online learning, for achieving the capability of quick adaptation on new tasks in a lifelong manner. However, most existing approaches focus on the restrictive setting where the distribution of the online tasks remains fixed with known task boundaries. In this work, we relax these assumptions and propose a novel algorithm for task-agnostic online meta-learning in non-stationary environments. More specifically, we first propose two simple but effective detection mechanisms of task switches and distribution shift based on empirical observations, which serve as a key building block for more elegant online model updates in our algorithm: the task switch detection mechanism allows reusing of the best model available for the current task at hand, and the distribution shift detection mechanism differentiates the meta model update in order to preserve the knowledge for in-distribution tasks and quickly learn the new knowledge for out-of-distribution tasks. In particular, our online meta model updates are based only on the current data, which eliminates the need of storing previous data as required in most existing methods. We further show that a sublinear task-averaged regret can be achieved for our algorithm under mild conditions. Empirical studies on three different benchmarks clearly demonstrate the significant advantage of our algorithm over related baseline approaches.
    Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. (arXiv:2202.12183v5 [cs.LG] UPDATED)
    NDCG, namely Normalized Discounted Cumulative Gain, is a widely used ranking metric in information retrieval and machine learning. However, efficient and provable stochastic methods for maximizing NDCG are still lacking, especially for deep models. In this paper, we propose a principled approach to optimize NDCG and its top-$K$ variant. First, we formulate a novel compositional optimization problem for optimizing the NDCG surrogate, and a novel bilevel compositional optimization problem for optimizing the top-$K$ NDCG surrogate. Then, we develop efficient stochastic algorithms with provable convergence guarantees for the non-convex objectives. Different from existing NDCG optimization methods, the per-iteration complexity of our algorithms scales with the mini-batch size instead of the number of total items. To improve the effectiveness for deep learning, we further propose practical strategies by using initial warm-up and stop gradient operator. Experimental results on multiple datasets demonstrate that our methods outperform prior ranking approaches in terms of NDCG. To the best of our knowledge, this is the first time that stochastic algorithms are proposed to optimize NDCG with a provable convergence guarantee. Our proposed methods are implemented in the LibAUC library at https://libauc.org/.
    What Language Reveals about Perception: Distilling Psychophysical Knowledge from Large Language Models. (arXiv:2302.01308v1 [cs.CL])
    Understanding the extent to which the perceptual world can be recovered from language is a fundamental problem in cognitive science. We reformulate this problem as that of distilling psychophysical information from text and show how this can be done by combining large language models (LLMs) with a classic psychophysical method based on similarity judgments. Specifically, we use the prompt auto-completion functionality of GPT3, a state-of-the-art LLM, to elicit similarity scores between stimuli and then apply multidimensional scaling to uncover their underlying psychological space. We test our approach on six perceptual domains and show that the elicited judgments strongly correlate with human data and successfully recover well-known psychophysical structures such as the color wheel and pitch spiral. We also explore meaningful divergences between LLM and human representations. Our work showcases how combining state-of-the-art machine models with well-known cognitive paradigms can shed new light on fundamental questions in perception and language research.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v1 [cs.LG] CROSS LISTED)
    Motivated by time series forecasting, we study Online Linear Optimization (OLO) under the coupling of two problem structures: the domain is unbounded, and the performance of an algorithm is measured by its dynamic regret. Handling either of them requires the regret bound to depend on certain complexity measure of the comparator sequence -- specifically, the comparator norm in unconstrained OLO, and the path length in dynamic regret. In contrast to a recent work (Jacobsen & Cutkosky, 2022) that adapts to the combination of these two complexity measures, we propose an alternative complexity measure by recasting the problem into sparse coding. Adaptivity can be achieved by a simple modular framework, which naturally exploits more intricate prior knowledge of the environment. Along the way, we also present a new gradient adaptive algorithm for static unconstrained OLO, designed using novel continuous time machinery. This could be of independent interest.
    A Machine Learning Approach to Measuring Climate Adaptation. (arXiv:2302.01236v1 [stat.AP])
    I measure adaptation to climate change by comparing elasticities from short-run and long-run changes in damaging weather. I propose a debiased machine learning approach to flexibly measure these elasticities in panel settings. In a simulation exercise, I show that debiased machine learning has considerable benefits relative to standard machine learning or ordinary least squares, particularly in high-dimensional settings. I then measure adaptation to damaging heat exposure in United States corn and soy production. Using rich sets of temperature and precipitation variation, I find evidence that short-run impacts from damaging heat are significantly offset in the long run. I show that this is because the impacts of long-run changes in heat exposure do not follow the same functional form as short-run shocks to heat exposure.
    Interventional and Counterfactual Inference with Diffusion Models. (arXiv:2302.00860v1 [stat.ML])
    We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings to allow for direct sampling under interventions as well as abduction for counterfactuals. We utilize DCM to model structural equations, seeing that diffusion models serve as a natural candidate here since they encode each node to a latent representation, a proxy for the exogenous noise, and offer flexible and accurate modeling to provide reliable causal statements and estimates. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Our theoretical results provide a methodology for analyzing the counterfactual error for general encoder/decoder models which could be of independent interest.
    Correlated Initialization for Correlated Data. (arXiv:2003.04422v2 [cs.LG] UPDATED)
    Spatial data exhibits the property that nearby points are correlated. This also holds for learnt representations across layers, but not for commonly used weight initialization methods. Our theoretical analysis quantifies the learning behavior of weights of a single spatial filter. It is thus in contrast to a large body of work that discusses statistical properties of weights. It shows that uncorrelated initialization (i) might lead to poor convergence behavior and (ii) training of (some) parameters is likely subject to slow convergence. Empirical analysis shows that these findings for a single spatial filter extend to networks with many spatial filters. The impact of (correlated) initialization depends strongly on learning rates and l2-regularization.
    Analysis of Knowledge Transfer in Kernel Regime. (arXiv:2003.13438v3 [cs.LG] UPDATED)
    Knowledge transfer is shown to be a very successful technique for training neural classifiers: together with the ground truth data, it uses the "privileged information" (PI) obtained by a "teacher" network to train a "student" network. It has been observed that classifiers learn much faster and more reliably via knowledge transfer. However, there has been little or no theoretical analysis of this phenomenon. To bridge this gap, we propose to approach the problem of knowledge transfer by regularizing the fit between the teacher and the student with PI provided by the teacher. Using tools from dynamical systems theory, we show that when the student is an extremely wide two layer network, we can analyze it in the kernel regime and show that it is able to interpolate between PI and the given data. This characterization sheds new light on the relation between the training error and capacity of the student relative to the teacher. Another contribution of the paper is a quantitative statement on the convergence of student network. We prove that the teacher reduces the number of required iterations for a student to learn, and consequently improves the generalization power of the student. We give corresponding experimental analysis that validates the theoretical results and yield additional insights.
    An Exponentially Increasing Step-size for Parameter Estimation in Statistical Models. (arXiv:2205.07999v2 [stat.ML] UPDATED)
    Using gradient descent (GD) with fixed or decaying step-size is a standard practice in unconstrained optimization problems. However, when the loss function is only locally convex, such a step-size schedule artificially slows GD down as it cannot explore the flat curvature of the loss function. To overcome that issue, we propose to exponentially increase the step-size of the GD algorithm. Under homogeneous assumptions on the loss function, we demonstrate that the iterates of the proposed \emph{exponential step size gradient descent} (EGD) algorithm converge linearly to the optimal solution. Leveraging that optimization insight, we then consider using the EGD algorithm for solving parameter estimation under both regular and non-regular statistical models whose loss function becomes locally convex when the sample size goes to infinity. We demonstrate that the EGD iterates reach the final statistical radius within the true parameter after a logarithmic number of iterations, which is in stark contrast to a \emph{polynomial} number of iterations of the GD algorithm in non-regular statistical models. Therefore, the total computational complexity of the EGD algorithm is \emph{optimal} and exponentially cheaper than that of the GD for solving parameter estimation in non-regular statistical models while being comparable to that of the GD in regular statistical settings. To the best of our knowledge, it resolves a long-standing gap between statistical and algorithmic computational complexities of parameter estimation in non-regular statistical models. Finally, we provide targeted applications of the general theory to several classes of statistical models, including generalized linear models with polynomial link functions and location Gaussian mixture models.
    Efficient Privacy-Preserving Stochastic Nonconvex Optimization. (arXiv:1910.13659v3 [cs.LG] UPDATED)
    While many solutions for privacy-preserving convex empirical risk minimization (ERM) have been developed, privacy-preserving nonconvex ERM remains a challenge. We study nonconvex ERM, which takes the form of minimizing a finite-sum of nonconvex loss functions over a training set. We propose a new differentially private stochastic gradient descent algorithm for nonconvex ERM that achieves strong privacy guarantees efficiently, and provide a tight analysis of its privacy and utility guarantees, as well as its gradient complexity. Our algorithm reduces gradient complexity while improves the best previous utility guarantee given by Wang et al. (NeurIPS 2017). Our experiments on benchmark nonconvex ERM problems demonstrate superior performance in terms of both training cost and utility gains compared with previous differentially private methods using the same privacy budgets.
    Hierarchical shrinkage Gaussian processes: applications to computer code emulation and dynamical system recovery. (arXiv:2302.00755v1 [stat.ML])
    In many areas of science and engineering, computer simulations are widely used as proxies for physical experiments, which can be infeasible or unethical. Such simulations can often be computationally expensive, and an emulator can be trained to efficiently predict the desired response surface. A widely-used emulator is the Gaussian process (GP), which provides a flexible framework for efficient prediction and uncertainty quantification. Standard GPs, however, do not capture structured sparsity on the underlying response surface, which is present in many applications, particularly in the physical sciences. We thus propose a new hierarchical shrinkage GP (HierGP), which incorporates such structure via cumulative shrinkage priors within a GP framework. We show that the HierGP implicitly embeds the well-known principles of effect sparsity, heredity and hierarchy for analysis of experiments, which allows our model to identify structured sparse features from the response surface with limited data. We propose efficient posterior sampling algorithms for model training and prediction, and prove desirable consistency properties for the HierGP. Finally, we demonstrate the improved performance of HierGP over existing models, in a suite of numerical experiments and an application to dynamical system recovery.
    On the Efficacy of Differentially Private Few-shot Image Classification. (arXiv:2302.01190v1 [stat.ML])
    There has been significant recent progress in training differentially private (DP) models which achieve accuracy that approaches the best non-private models. These DP models are typically pretrained on large public datasets and then fine-tuned on downstream datasets that are (i) relatively large, and (ii) similar in distribution to the pretraining data. However, in many applications including personalization, it is crucial to perform well in the few-shot setting, as obtaining large amounts of labeled data may be problematic; and on images from a wide variety of domains for use in various specialist settings. To understand under which conditions few-shot DP can be effective, we perform an exhaustive set of experiments that reveals how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, dataset, and subset of learnable parameters in the model vary. We show that to achieve DP accuracy on par with non-private models, the shots per class must be increased as the privacy level increases by as much as 32$\times$ for CIFAR-100 at $\epsilon=1$. We also find that few-shot non-private models are highly susceptible to membership inference attacks. DP provides clear mitigation against the attacks, but a small $\epsilon$ is required to effectively prevent them. Finally, we evaluate DP federated learning systems and establish state-of-the-art performance on the challenging FLAIR federated learning benchmark.
    Learning polytopes with fixed facet directions. (arXiv:2201.03419v4 [math.MG] UPDATED)
    We consider the task of reconstructing polytopes with fixed facet directions from finitely many support function evaluations. We show that for a fixed simplicial normal fan the least-squares estimate is given by a convex quadratic program. We study the geometry of the solution set and give a combinatorial characterization for the uniqueness of the reconstruction in this case. We provide an algorithm that, under mild assumptions, converges to the unknown input shape as the number of noisy support function evaluations increases. We also discuss limitations of our results if the restriction on the normal fan is removed.
    Stochastic Contextual Bandits with Long Horizon Rewards. (arXiv:2302.00814v1 [cs.LG])
    The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most $s$ prior actions and contexts (not necessarily consecutive), up to a time horizon of $h$. In order to avoid polynomial dependence on $h$, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor ($T<h$) and data-rich ($T\ge h$) regimes, and derive respective regret upper bounds $\tilde O(d\sqrt{sT} +\min\{ q, T\})$ and $\tilde O(\sqrt{sdT})$, with sparsity $s$, feature dimension $d$, total time horizon $T$, and $q$ that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon $h$. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.
    The Value of Out-of-Distribution Data. (arXiv:2208.10967v3 [cs.LG] UPDATED)
    We expect the generalization error to improve with more samples from a similar task, and to deteriorate with more samples from an out-of-distribution (OOD) task. In this work, we show a counter-intuitive phenomenon: the generalization error of a task can be a non-monotonic function of the number of OOD samples. As the number of OOD samples increases, the generalization error on the target task improves before deteriorating beyond a threshold. In other words, there is value in training on small amounts of OOD data. We use Fisher's Linear Discriminant on synthetic datasets and deep networks on computer vision benchmarks such as MNIST, CIFAR-10, CINIC-10, PACS and DomainNet to demonstrate and analyze this phenomenon. In the idealistic setting where we know which samples are OOD, we show that these non-monotonic trends can be exploited using an appropriately weighted objective of the target and OOD empirical risk. While its practical utility is limited, this does suggest that if we can detect OOD samples, then there may be ways to benefit from them. When we do not know which samples are OOD, we show how a number of go-to strategies such as data-augmentation, hyper-parameter optimization, and pre-training are not enough to ensure that the target generalization error does not deteriorate with the number of OOD samples in the dataset.
    Sample Complexity of Kernel-Based Q-Learning. (arXiv:2302.00727v1 [cs.LG])
    Modern reinforcement learning (RL) often faces an enormous state-action space. Existing analytical results are typically for settings with a small number of state-actions, or simple models such as linearly modeled Q-functions. To derive statistically efficient RL policies handling large state-action spaces, with more general Q-functions, some recent works have considered nonlinear function approximation using kernel ridge regression. In this work, we derive sample complexities for kernel based Q-learning when a generative model exists. We propose a nonparametric Q-learning algorithm which finds an $\epsilon$-optimal policy in an arbitrarily large scale discounted MDP. The sample complexity of the proposed algorithm is order optimal with respect to $\epsilon$ and the complexity of the kernel (in terms of its information gain). To the best of our knowledge, this is the first result showing a finite sample complexity under such a general model.
    Unsupervised Learning of Sampling Distributions for Particle Filters. (arXiv:2302.01174v1 [eess.SP])
    Accurate estimation of the states of a nonlinear dynamical system is crucial for their design, synthesis, and analysis. Particle filters are estimators constructed by simulating trajectories from a sampling distribution and averaging them based on their importance weight. For particle filters to be computationally tractable, it must be feasible to simulate the trajectories by drawing from the sampling distribution. Simultaneously, these trajectories need to reflect the reality of the nonlinear dynamical system so that the resulting estimators are accurate. Thus, the crux of particle filters lies in designing sampling distributions that are both easy to sample from and lead to accurate estimators. In this work, we propose to learn the sampling distributions. We put forward four methods for learning sampling distributions from observed measurements. Three of the methods are parametric methods in which we learn the mean and covariance matrix of a multivariate Gaussian distribution; each methods exploits a different aspect of the data (generic, time structure, graph structure). The fourth method is a nonparametric alternative in which we directly learn a transform of a uniform random variable. All four methods are trained in an unsupervised manner by maximizing the likelihood that the states may have produced the observed measurements. Our computational experiments demonstrate that learned sampling distributions exhibit better performance than designed, minimum-degeneracy sampling distributions.
    Avoiding Model Estimation in Robust Markov Decision Processes with a Generative Model. (arXiv:2302.01248v1 [stat.ML])
    Robust Markov Decision Processes (MDPs) are getting more attention for learning a robust policy which is less sensitive to environment changes. There are an increasing number of works analyzing sample-efficiency of robust MDPs. However, most works study robust MDPs in a model-based regime, where the transition probability needs to be estimated and requires $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$ storage in memory. A common way to solve robust MDPs is to formulate them as a distributionally robust optimization (DRO) problem. However, solving a DRO problem is non-trivial, so prior works typically assume a strong oracle to obtain the optimal solution of the DRO problem easily. To remove the need for an oracle, we first transform the original robust MDPs into an alternative form, as the alternative form allows us to use stochastic gradient methods to solve the robust MDPs. Moreover, we prove the alternative form still preserves the role of robustness. With this new formulation, we devise a sample-efficient algorithm to solve the robust MDPs in a model-free regime, from which we benefit lower memory space $\mathcal{O}(|\mathcal{S}||\mathcal{A}|)$ without using the oracle. Finally, we validate our theoretical findings via numerical experiments and show the efficiency to solve the alternative form of robust MDPs.  ( 2 min )
    Fast Online Value-Maximizing Prediction Sets with Conformal Cost Control. (arXiv:2302.00839v1 [cs.LG])
    Many real-world multi-label prediction problems involve set-valued predictions that must satisfy specific requirements dictated by downstream usage. We focus on a typical scenario where such requirements, separately encoding \textit{value} and \textit{cost}, compete with each other. For instance, a hospital might expect a smart diagnosis system to capture as many severe, often co-morbid, diseases as possible (the value), while maintaining strict control over incorrect predictions (the cost). We present a general pipeline, dubbed as FavMac, to maximize the value while controlling the cost in such scenarios. FavMac can be combined with almost any multi-label classifier, affording distribution-free theoretical guarantees on cost control. Moreover, unlike prior works, FavMac can handle real-world large-scale applications via a carefully designed online update mechanism, which is of independent interest. Our methodological and theoretical contributions are supported by experiments on several healthcare tasks and synthetic datasets - FavMac furnishes higher value compared with several variants and baselines while maintaining strict cost control.  ( 2 min )
    Causal Lifting and Link Prediction. (arXiv:2302.01198v1 [cs.LG])
    Current state-of-the-art causal models for link prediction assume an underlying set of inherent node factors -- an innate characteristic defined at the node's birth -- that governs the causal evolution of links in the graph. In some causal tasks, however, link formation is path-dependent, i.e., the outcome of link interventions depends on existing links. For instance, in the customer-product graph of an online retailer, the effect of an 85-inch TV ad (treatment) likely depends on whether the costumer already has an 85-inch TV. Unfortunately, existing causal methods are impractical in these scenarios. The cascading functional dependencies between links (due to path dependence) are either unidentifiable or require an impractical number of control variables. In order to remedy this shortcoming, this work develops the first causal model capable of dealing with path dependencies in link prediction. It introduces the concept of causal lifting, an invariance in causal models that, when satisfied, allows the identification of causal link prediction queries using limited interventional data. On the estimation side, we show how structural pairwise embeddings -- a type of symmetry-based joint representation of node pairs in a graph -- exhibit lower bias and correctly represent the causal structure of the task, as opposed to existing node embedding methods, e.g., GNNs and matrix factorization. Finally, we validate our theoretical findings on four datasets under three different scenarios for causal link prediction tasks: knowledge base completion, covariance matrix estimation and consumer-product recommendations.  ( 2 min )
    Neural Estimation of the Rate-Distortion Function With Applications to Operational Source Coding. (arXiv:2204.01612v2 [cs.IT] UPDATED)
    A fundamental question in designing lossy data compression schemes is how well one can do in comparison with the rate-distortion function, which describes the known theoretical limits of lossy compression. Motivated by the empirical success of deep neural network (DNN) compressors on large, real-world data, we investigate methods to estimate the rate-distortion function on such data, which would allow comparison of DNN compressors with optimality. While one could use the empirical distribution of the data and apply the Blahut-Arimoto algorithm, this approach presents several computational challenges and inaccuracies when the datasets are large and high-dimensional, such as the case of modern image datasets. Instead, we re-formulate the rate-distortion objective, and solve the resulting functional optimization problem using neural networks. We apply the resulting rate-distortion estimator, called NERD, on popular image datasets, and provide evidence that NERD can accurately estimate the rate-distortion function. Using our estimate, we show that the rate-distortion achievable by DNN compressors are within several bits of the rate-distortion function for real-world datasets. Additionally, NERD provides access to the rate-distortion achieving channel, as well as samples from its output marginal. Therefore, using recent results in reverse channel coding, we describe how NERD can be used to construct an operational one-shot lossy compression scheme with guarantees on the achievable rate and distortion. Experimental results demonstrate competitive performance with DNN compressors.  ( 2 min )
    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias. (arXiv:2205.09072v2 [cs.LG] UPDATED)
    We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a network achieving perfect training accuracy and having at most $\mathcal{O}(r)$ linear regions, implying a generalization bound. Unlike many other results in the literature, under an additional assumption on the distribution of the data, our result holds even for mild over-parameterization, where the width is $\tilde{\mathcal{O}}(r)$ and independent of the sample size.  ( 2 min )
    An Instrumental Variable Approach to Confounded Off-Policy Evaluation. (arXiv:2212.14468v2 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) is a method for estimating the return of a target policy using some pre-collected observational data generated by a potentially different behavior policy. In some cases, there may be unmeasured variables that can confound the action-reward or action-next-state relationships, rendering many existing OPE approaches ineffective. This paper develops an instrumental variable (IV)-based method for consistent OPE in confounded Markov decision processes (MDPs). Similar to single-stage decision making, we show that IV enables us to correctly identify the target policy's value in infinite horizon settings as well. Furthermore, we propose an efficient and robust value estimator and illustrate its effectiveness through extensive simulations and analysis of real data from a world-leading short-video platform.  ( 2 min )
    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data. (arXiv:2302.00834v1 [cs.LG])
    We study the interpolation, or memorization, power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at $N$ datapoints in the unit ball which are separated by a distance $\delta$. We show that $\Omega(N)$ parameters are required in the regime where $\delta$ is exponentially small in $N$, which gives the sharp result in this regime since $O(N)$ parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints.  ( 2 min )
    Safe Optimization of an Industrial Refrigeration Process Using an Adaptive and Explorative Framework. (arXiv:2211.13019v2 [math.OC] UPDATED)
    Many industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown process characteristics, real-time optimization becomes challenging, particularly for the satisfaction of safety constraints. In this paper, we demonstrate the application of an adaptive and explorative real-time optimization framework to an industrial refrigeration process, where we learn the process characteristics through changes in process control targets and through exploration to satisfy safety constraints. We quantify the uncertainty in unknown compressor characteristics of the refrigeration plant by using Gaussian processes and incorporate this uncertainty into the objective function of the real-time optimization problem as a weighted cost term. We adaptively control the weight of this term to drive exploration. The results of our simulation experiments indicate the proposed approach can help to increase the energy efficiency of the considered refrigeration process, closely approximating the performance of a solution that has complete information about the compressor performance characteristics.  ( 2 min )
    FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features. (arXiv:2302.00787v1 [cs.LG])
    The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in important applications ranging from kernel methods to efficient Transformers. We propose parameterized, positive, non-trigonometric RFs which approximate Gaussian and softmax-kernels. In contrast to traditional RF approximations, parameters of these new methods can be optimized to reduce the variance of the approximation, and the optimum can be expressed in closed form. We show that our methods lead to variance reduction in practice ($e^{10}$-times smaller variance and beyond) and outperform previous methods in a kernel regression task. Using our proposed mechanism, we also present FAVOR#, a method for self-attention approximation in Transformers. We show that FAVOR# outperforms other random feature methods in speech modelling and natural language processing.
    Oracle-Preserving Latent Flows. (arXiv:2302.00806v1 [cs.LG])
    We develop a deep learning methodology for the simultaneous discovery of multiple nontrivial continuous symmetries across an entire labelled dataset. The symmetry transformations and the corresponding generators are modeled with fully connected neural networks trained with a specially constructed loss function ensuring the desired symmetry properties. The two new elements in this work are the use of a reduced-dimensionality latent space and the generalization to transformations invariant with respect to high-dimensional oracles. The method is demonstrated with several examples on the MNIST digit dataset.
    Post-hoc Concept Bottleneck Models. (arXiv:2205.15480v2 [cs.LG] UPDATED)
    Concept Bottleneck Models (CBMs) map the inputs onto a set of interpretable concepts (``the bottleneck'') and use the concepts to make predictions. A concept bottleneck enhances interpretability since it can be investigated to understand what concepts the model "sees" in an input and which of these concepts are deemed important. However, CBMs are restrictive in practice as they require dense concept annotations in the training data to learn the bottleneck. Moreover, CBMs often do not match the accuracy of an unrestricted neural network, reducing the incentive to deploy them in practice. In this work, we address these limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). We show that we can turn any neural network into a PCBM without sacrificing model performance while still retaining the interpretability benefits. When concept annotations are not available on the training data, we show that PCBM can transfer concepts from other datasets or from natural language descriptions of concepts via multimodal models. A key benefit of PCBM is that it enables users to quickly debug and update the model to reduce spurious correlations and improve generalization to new distributions. PCBM allows for global model edits, which can be more efficient than previous works on local interventions that fix a specific prediction. Through a model-editing user study, we show that editing PCBMs via concept-level feedback can provide significant performance gains without using data from the target domain or model retraining.  ( 2 min )
    Timewarp: Transferable Acceleration of Molecular Dynamics by Learning Time-Coarsened Dynamics. (arXiv:2302.01170v1 [stat.ML])
    Molecular dynamics (MD) simulation is a widely used technique to simulate molecular systems, most commonly at the all-atom resolution where the equations of motion are integrated with timesteps on the order of femtoseconds ($1\textrm{fs}=10^{-15}\textrm{s}$). MD is often used to compute equilibrium properties, which requires sampling from an equilibrium distribution such as the Boltzmann distribution. However, many important processes, such as binding and folding, occur over timescales of milliseconds or beyond, and cannot be efficiently sampled with conventional MD. Furthermore, new MD simulations need to be performed from scratch for each molecular system studied. We present Timewarp, an enhanced sampling method which uses a normalising flow as a proposal distribution in a Markov chain Monte Carlo method targeting the Boltzmann distribution. The flow is trained offline on MD trajectories and learns to make large steps in time, simulating the molecular dynamics of $10^{5} - 10^{6}\:\textrm{fs}$. Crucially, Timewarp is transferable between molecular systems: once trained, we show that it generalises to unseen small peptides (2-4 amino acids), exploring their metastable states and providing wall-clock acceleration when sampling compared to standard MD. Our method constitutes an important step towards developing general, transferable algorithms for accelerating MD.  ( 2 min )
    Sketched Ridgeless Linear Regression: The Role of Downsampling. (arXiv:2302.01088v1 [math.ST])
    Overparametrization often helps improve the generalization performance. This paper proposes a dual view of overparametrization suggesting that downsampling may also help generalize. Motivated by this dual view, we characterize two out-of-sample prediction risks of the sketched ridgeless least square estimator in the proportional regime $m\asymp n \asymp p$, where $m$ is the sketching size, $n$ the sample size, and $p$ the feature dimensionality. Our results reveal the statistical role of downsampling. Specifically, downsampling does not always hurt the generalization performance, and may actually help improve it in some cases. We identify the optimal sketching sizes that minimize the out-of-sample prediction risks, and find that the optimally sketched estimator has stabler risk curves that eliminates the peaks of those for the full-sample estimator. We then propose a practical procedure to empirically identify the optimal sketching size. Finally, we extend our results to cover central limit theorems and misspecified models. Numerical studies strongly support our theory.  ( 2 min )
    MonoFlow: Rethinking Divergence GANs via the Perspective of Differential Equations. (arXiv:2302.01075v1 [stat.ML])
    The conventional understanding of adversarial training in generative adversarial networks (GANs) is that the discriminator is trained to estimate a divergence, and the generator learns to minimize this divergence. We argue that despite the fact that many variants of GANs were developed following this paradigm, the current theoretical understanding of GANs and their practical algorithms are inconsistent. In this paper, we leverage Wasserstein gradient flows which characterize the evolution of particles in the sample space, to gain theoretical insights and algorithmic inspiration of GANs. We introduce a unified generative modeling framework - MonoFlow: the particle evolution is rescaled via a monotonically increasing mapping of the log density ratio. Under our framework, adversarial training can be viewed as a procedure first obtaining MonoFlow's vector field via training the discriminator and the generator learns to draw the particle flow defined by the corresponding vector field. We also reveal the fundamental difference between variational divergence minimization and adversarial training. This analysis helps us to identify what types of generator loss functions can lead to the successful training of GANs and suggest that GANs may have more loss designs beyond the literature (e.g., non-saturated loss), as long as they realize MonoFlow. Consistent empirical studies are included to validate the effectiveness of our framework.  ( 2 min )
    The Contextual Lasso: Sparse Linear Models via Deep Neural Networks. (arXiv:2302.00878v1 [stat.ML])
    Sparse linear models are a gold standard tool for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which we wish to explain the model's predictions, and contextual features, which we wish to determine the model's explanations. This dichotomy leads us to propose the contextual lasso, a new statistical estimator that fits a sparse linear model whose sparsity pattern and coefficients can vary with the contextual features. The fitting process involves learning a nonparametric map, realized via a deep neural network, from contextual feature vector to sparse coefficient vector. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. Extensive experiments on real and synthetic data suggest that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.  ( 2 min )
    High-dimensional variable clustering based on sub-asymptotic maxima of a weakly dependent random process. (arXiv:2302.00934v1 [math.ST])
    We propose a new class of models for variable clustering called Asymptotic Independent block (AI-block) models, which defines population-level clusters based on the independence of the maxima of a multivariate stationary mixing random process among clusters. This class of models is identifiable, meaning that there exists a maximal element with a partial order between partitions, allowing for statistical inference. We also present an algorithm for recovering the clusters of variables without specifying the number of clusters \emph{a priori}. Our work provides some theoritical insights into the consistency of our algorithm, demonstrating that under certain conditions it can effectively identify clusters in the data with a computational complexity that is polynomial in the dimension. This implies that groups can be learned nonparametrically in which block maxima of a dependent process are only sub-asymptotic.  ( 2 min )
    Epistemic Neural Networks. (arXiv:2107.08924v7 [cs.LG] UPDATED)
    Intelligence relies on an agent's knowledge of what it does not know. This capability can be assessed based on the quality of joint predictions of labels across multiple inputs. In principle, ensemble-based approaches produce effective joint predictions, but the computational costs of training large ensembles can become prohibitive. We introduce the epinet: an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty. With an epinet, conventional neural networks outperform very large ensembles, consisting of hundreds or more particles, with orders of magnitude less computation. The epinet does not fit the traditional framework of Bayesian neural networks. To accommodate development of approaches beyond BNNs, such as the epinet, we introduce the epistemic neural network (ENN) as an interface for models that produce joint predictions.  ( 2 min )
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v2 [cs.LG] UPDATED)
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These changes result in distribution shifts across environments. Attributing model performance changes to specific shifts is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to distribution shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game where the players are distributions. We define the value of a set of distributions to be the change in model performance when only this set of distributions has changed between environments, and derive an importance weighting method for computing the value of an arbitrary set of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on synthetic, semi-synthetic, and real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.  ( 2 min )
    Rare Feature Selection in High Dimensions. (arXiv:1803.06675v3 [stat.ME] UPDATED)
    It is common in modern prediction problems for many predictor variables to be counts of rarely occurring events. This leads to design matrices in which many columns are highly sparse. The challenge posed by such "rare features" has received little attention despite its prevalence in diverse areas, ranging from natural language processing (e.g., rare words) to biology (e.g., rare species). We show, both theoretically and empirically, that not explicitly accounting for the rareness of features can greatly reduce the effectiveness of an analysis. We next propose a framework for aggregating rare features into denser features in a flexible manner that creates better predictors of the response. Our strategy leverages side information in the form of a tree that encodes feature similarity. We apply our method to data from TripAdvisor, in which we predict the numerical rating of a hotel based on the text of the associated review. Our method achieves high accuracy by making effective use of rare words; by contrast, the lasso is unable to identify highly predictive words if they are too rare. A companion R package, called rare, implements our new estimator, using the alternating direction method of multipliers.  ( 2 min )
    Do Kernel and Neural Embeddings Help in Training and Generalization?. (arXiv:1905.05095v3 [cs.LG] UPDATED)
    Recent results on optimization and generalization properties of neural networks showed that in a simple two-layer network, the alignment of the labels to the eigenvectors of the corresponding Gram matrix determines the convergence of the optimization during training. Such analyses also provide upper bounds on the generalization error. We experimentally investigate the implications of these results to deeper networks via embeddings. We regard the layers preceding the final hidden layer as producing different representations of the input data which are then fed to the two-layer model. We show that these representations improve both optimization and generalization. In particular, we investigate three kernel representations when fed to the final hidden layer: the Gaussian kernel and its approximation by random Fourier features, kernels designed to imitate representations produced by neural networks and finally an optimal kernel designed to align the data with target labels. The approximated representations induced by these kernels are fed to the neural network and the optimization and generalization properties of the final model are evaluated and compared.  ( 2 min )
    Optimal Stopping via Randomized Neural Networks. (arXiv:2104.13669v3 [stat.ML] UPDATED)
    This paper presents new machine learning approaches to approximate the solutions of optimal stopping problems. The key idea of these methods is to use neural networks, where the parameters of the hidden layers are generated randomly and only the last layer is trained, in order to approximate the continuation value. Our approaches are applicable to high dimensional problems where the existing approaches become increasingly impractical. In addition, since our approaches can be optimized using simple linear regression, they are easy to implement and theoretical guarantees are provided. Our randomized reinforcement learning approach and randomized recurrent neural network approach outperform the state-of-the-art and other relevant machine learning approaches in Markovian and non-Markovian examples, respectively. In particular, we test our approaches on Black-Scholes, Heston, rough Heston and fractional Brownian motion. Moreover, we show that they can also be used to efficiently compute Greeks of American options.  ( 2 min )
    A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models. (arXiv:2302.01217v1 [stat.ML])
    We provide a theoretical justification for sample recovery using diffusion based image inpainting in a linear model setting. While most inpainting algorithms require retraining with each new mask, we prove that diffusion based inpainting generalizes well to unseen masks without retraining. We analyze a recently proposed popular diffusion based inpainting algorithm called RePaint (Lugmayr et al., 2022), and show that it has a bias due to misalignment that hampers sample recovery even in a two-state diffusion process. Motivated by our analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that provably recovers the underlying true sample and enjoys a linear rate of convergence. It achieves this by rectifying the misalignment error present in drift and dispersion of the reverse process. To the best of our knowledge, this is the first linear convergence result for a diffusion based image inpainting algorithm.  ( 2 min )
    Bayesian Optimization of Multiple Objectives with Different Latencies. (arXiv:2302.01310v1 [stat.ML])
    Multi-objective Bayesian optimization aims to find the Pareto front of optimal trade-offs between a set of expensive objectives while collecting as few samples as possible. In some cases, it is possible to evaluate the objectives separately, and a different latency or evaluation cost can be associated with each objective. This presents an opportunity to learn the Pareto front faster by evaluating the cheaper objectives more frequently. We propose a scalarization based knowledge gradient acquisition function which accounts for the different evaluation costs of the objectives. We prove consistency of the algorithm and show empirically that it significantly outperforms a benchmark algorithm which always evaluates both objectives.  ( 2 min )
    Causal Effect Estimation: Recent Advances, Challenges, and Opportunities. (arXiv:2302.00848v1 [cs.LG])
    Causal inference has numerous real-world applications in many domains, such as health care, marketing, political science, and online advertising. Treatment effect estimation, a fundamental problem in causal inference, has been extensively studied in statistics for decades. However, traditional treatment effect estimation methods may not well handle large-scale and high-dimensional heterogeneous data. In recent years, an emerging research direction has attracted increasing attention in the broad artificial intelligence field, which combines the advantages of traditional treatment effect estimation approaches (e.g., propensity score, matching, and reweighing) and advanced machine learning approaches (e.g., representation learning, adversarial learning, and graph neural networks). Although the advanced machine learning approaches have shown extraordinary performance in treatment effect estimation, it also comes with a lot of new topics and new research questions. In view of the latest research efforts in the causal inference field, we provide a comprehensive discussion of challenges and opportunities for the three core components of the treatment effect estimation task, i.e., treatment, covariates, and outcome. In addition, we showcase the promising research directions of this topic from multiple perspectives.  ( 2 min )
    Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning. (arXiv:2302.01002v1 [stat.ML])
    We consider the optimisation of large and shallow neural networks via gradient flow, where the output of each hidden node is scaled by some positive parameter. We focus on the case where the node scalings are non-identical, differing from the classical Neural Tangent Kernel (NTK) parameterisation. We prove that, for large neural networks, with high probability, gradient flow converges to a global minimum AND can learn features, unlike in the NTK regime. We also provide experiments on synthetic and real-world datasets illustrating our theoretical results and showing the benefit of such scaling in terms of pruning and transfer learning.  ( 2 min )
    Robust Estimation under the Wasserstein Distance. (arXiv:2302.01237v1 [stat.ML])
    We study the problem of robust distribution estimation under the Wasserstein metric, a popular discrepancy measure between probability distributions rooted in optimal transport (OT) theory. We introduce a new outlier-robust Wasserstein distance $\mathsf{W}_p^\varepsilon$ which allows for $\varepsilon$ outlier mass to be removed from its input distributions, and show that minimum distance estimation under $\mathsf{W}_p^\varepsilon$ achieves minimax optimal robust estimation risk. Our analysis is rooted in several new results for partial OT, including an approximate triangle inequality, which may be of independent interest. To address computational tractability, we derive a dual formulation for $\mathsf{W}_p^\varepsilon$ that adds a simple penalty term to the classic Kantorovich dual objective. As such, $\mathsf{W}_p^\varepsilon$ can be implemented via an elementary modification to standard, duality-based OT solvers. Our results are extended to sliced OT, where distributions are projected onto low-dimensional subspaces, and applications to homogeneity and independence testing are explored. We illustrate the virtues of our framework via applications to generative modeling with contaminated datasets.  ( 2 min )
    Randomized prior wavelet neural operator for uncertainty quantification. (arXiv:2302.01051v1 [stat.ML])
    In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework.  ( 2 min )
    Unpaired Multi-Domain Causal Representation Learning. (arXiv:2302.00993v1 [stat.ML])
    The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our identifiability results into a practical method to recover the shared latent causal graph. Moreover, we study how multiple domains reduce errors in falsely detecting shared causal variables in the finite data setting.  ( 2 min )
    The Power of Preconditioning in Overparameterized Low-Rank Matrix Sensing. (arXiv:2302.01186v1 [cs.LG])
    We propose $\textsf{ScaledGD($\lambda$)}$, a preconditioned gradient descent method to tackle the low-rank matrix sensing problem when the true rank is unknown, and when the matrix is possibly ill-conditioned. Using overparametrized factor representations, $\textsf{ScaledGD($\lambda$)}$ starts from a small random initialization, and proceeds by gradient descent with a specific form of damped preconditioning to combat bad curvatures induced by overparameterization and ill-conditioning. At the expense of light computational overhead incurred by preconditioners, $\textsf{ScaledGD($\lambda$)}$ is remarkably robust to ill-conditioning compared to vanilla gradient descent ($\textsf{GD}$) even with overprameterization. Specifically, we show that, under the Gaussian design, $\textsf{ScaledGD($\lambda$)}$ converges to the true low-rank matrix at a constant linear rate after a small number of iterations that scales only logarithmically with respect to the condition number and the problem dimension. This significantly improves over the convergence rate of vanilla $\textsf{GD}$ which suffers from a polynomial dependency on the condition number. Our work provides evidence on the power of preconditioning in accelerating the convergence without hurting generalization in overparameterized learning.  ( 2 min )
    Robust multi-item auction design using statistical learning: Overcoming uncertainty in bidders' types distributions. (arXiv:2302.00941v1 [cs.GT])
    This paper presents a novel mechanism design for multi-item auction settings with uncertain bidders' type distributions. Our proposed approach utilizes nonparametric density estimation to accurately estimate bidders' types from historical bids, and is built upon the Vickrey-Clarke-Groves (VCG) mechanism, ensuring satisfaction of Bayesian incentive compatibility (BIC) and $\delta$-individual rationality (IR). To further enhance the efficiency of our mechanism, we introduce two novel strategies for query reduction: a filtering method that screens potential winners' value regions within the confidence intervals generated by our estimated distribution, and a classification strategy that designates the lower bound of an interval as the estimated type when the length is below a threshold value. Simulation experiments conducted on both small-scale and large-scale data demonstrate that our mechanism consistently outperforms existing methods in terms of revenue maximization and query reduction, particularly in large-scale scenarios. This makes our proposed mechanism a highly desirable and effective option for sellers in the realm of multi-item auctions.  ( 2 min )
    Lower Bounds for Learning in Revealing POMDPs. (arXiv:2302.01333v1 [cs.LG])
    This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.  ( 2 min )
    Stochastic optimal transport in Banach Spaces for regularized estimation of multivariate quantiles. (arXiv:2302.00982v1 [math.PR])
    We introduce a new stochastic algorithm for solving entropic optimal transport (EOT) between two absolutely continuous probability measures $\mu$ and $\nu$. Our work is motivated by the specific setting of Monge-Kantorovich quantiles where the source measure $\mu$ is either the uniform distribution on the unit hypercube or the spherical uniform distribution. Using the knowledge of the source measure, we propose to parametrize a Kantorovich dual potential by its Fourier coefficients. In this way, each iteration of our stochastic algorithm reduces to two Fourier transforms that enables us to make use of the Fast Fourier Transform (FFT) in order to implement a fast numerical method to solve EOT. We study the almost sure convergence of our stochastic algorithm that takes its values in an infinite-dimensional Banach space. Then, using numerical experiments, we illustrate the performances of our approach on the computation of regularized Monge-Kantorovich quantiles. In particular, we investigate the potential benefits of entropic regularization for the smooth estimation of multivariate quantiles using data sampled from the target measure $\nu$.  ( 2 min )
    Brazilian tailing dam collapse, retrospective precursory monitoring of InSAR data using spectral analysis of time series. (arXiv:2302.00781v1 [stat.ME])
    Slope failures possess destructive power that can cause significant damage to both life and infrastructure. Monitoring slopes prone to instabilities is therefore critical in mitigating the risk posed by their failure. The purpose of slope monitoring is to detect precursory signs of stability issues, such as changes in the rate of displacement with which a slope is deforming. This information can then be used to predict the timing or probability of an imminent failure in order to provide an early warning. In this study, a more objective, statistical-learning algorithm is proposed to detect and characterise the risk of a slope failure, based on spectral analysis of serially correlated displacement time series data. The algorithm is applied to satellite-based interferometric synthetic radar (InSAR) displacement time series data to retrospectively analyse the risk of the 2019 Brumadinho tailings dam collapse in Brazil. Two potential risk milestones are identified and signs of a definitive but emergent risk (27 February 2018 to 26 August 2018) and imminent risk of collapse of the tailings dam (27 June 2018 to 24 December 2018) are detected by the algorithm. Importantly, this precursory indication of risk of failure is detected as early as at least five months prior to the dam collapse on 25 January 2019. The results of this study demonstrate that the combination of spectral methods and second order statistical properties of InSAR displacement time series data can reveal signs of a transition into an unstable deformation regime, and that this algorithm can provide sufficient early warning that could help mitigate catastrophic slope failures.  ( 2 min )
    A Light-weight CNN Model for Efficient Parkinson's Disease Diagnostics. (arXiv:2302.00973v1 [stat.ML])
    In recent years, deep learning methods have achieved great success in various fields due to their strong performance in practical applications. In this paper, we present a light-weight neural network for Parkinson's disease diagnostics, in which a series of hand-drawn data are collected to distinguish Parkinson's disease patients from healthy control subjects. The proposed model consists of a convolution neural network (CNN) cascading to long-short-term memory (LSTM) to adapt the characteristics of collected time-series signals. To make full use of their advantages, a multilayered LSTM model is firstly used to enrich features which are then concatenated with raw data and fed into a shallow one-dimensional (1D) CNN model for efficient classification. Experimental results show that the proposed model achieves a high-quality diagnostic result over multiple evaluation metrics with much fewer parameters and operations, outperforming conventional methods such as support vector machine (SVM), random forest (RF), lightgbm (LGB) and CNN-based methods.  ( 2 min )
    High-precision regressors for particle physics. (arXiv:2302.00753v1 [physics.comp-ph])
    Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of $10^9$ - $10^{12}$, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than $1\%$ or even $0.1\%$ over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of $10^3$ - $10^6$ over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.  ( 2 min )
    Conditional expectation for missing data imputation. (arXiv:2302.00911v1 [stat.ML])
    Missing data is common in datasets retrieved in various areas, such as medicine, sports, and finance. In many cases, to enable proper and reliable analyses of such data, the missing values are often imputed, and it is necessary that the method used has a low root mean square error (RMSE) between the imputed and the true values. In addition, for some critical applications, it is also often a requirement that the logic behind the imputation is explainable, which is especially difficult for complex methods that are for example, based on deep learning. This motivates us to introduce a conditional Distribution based Imputation of Missing Values (DIMV) algorithm. This approach works based on finding the conditional distribution of a feature with missing entries based on the fully observed features. As will be illustrated in the paper, DIMV (i) gives a low RMSE for the imputed values compared to state-of-the-art methods under comparison; (ii) is explainable; (iii) can provide an approximated confidence region for the missing values in a given sample; (iv) works for both small and large scale data; (v) in many scenarios, does not require a huge number of parameters as deep learning approaches and therefore can be used for mobile devices or web browsers; and (vi) is robust to the normally distributed assumption that its theoretical grounds rely on. In addition to DIMV, we also introduce the DPER* algorithm improving the speed of DPER for estimating the mean and covariance matrix from the data, and we confirm the speed-up via experiments.  ( 2 min )
    Pathologies of Predictive Diversity in Deep Ensembles. (arXiv:2302.00704v1 [cs.LG])
    Classical results establish that ensembles of small models benefit when predictive diversity is encouraged, through bagging, boosting, and similar. Here we demonstrate that this intuition does not carry over to ensembles of deep neural networks used for classification, and in fact the opposite can be true. Unlike regression models or small (unconfident) classifiers, predictions from large (confident) neural networks concentrate in vertices of the probability simplex. Thus, decorrelating these points necessarily moves the ensemble prediction away from vertices, harming confidence and moving points across decision boundaries. Through large scale experiments, we demonstrate that diversity-encouraging regularizers hurt the performance of high-capacity deep ensembles used for classification. Even more surprisingly, discouraging predictive diversity can be beneficial. Together this work strongly suggests that the best strategy for deep ensembles is utilizing more accurate, but likely less diverse, component models.  ( 2 min )
    Versatile Energy-Based Models for High Energy Physics. (arXiv:2302.00695v1 [cs.LG])
    Energy-based models have the natural advantage of flexibility in the form of the energy function. Recently, energy-based models have achieved great success in modeling high-dimensional data in computer vision and natural language processing. In accordance with these signs of progress, we build a versatile energy-based model for High Energy Physics events at the Large Hadron Collider. This framework builds on a powerful generative model and describes higher-order inter-particle interactions. It suits different encoding architectures and builds on implicit generation. As for applicational aspects, it can serve as a powerful parameterized event generator, a generic anomalous signal detector, and an augmented event classifier.  ( 2 min )

  • Open

    Is the purpose of gamma in Q-learning just to help the q-values converge?
    This might be a bit of a dumb question; so I understand the concept of the discount factor with calculating the sum of expected rewards: the closer gamma is to 1.0, the more emphasis the agent places on future rewards as opposed to its current reward. Generally though this seems to happen with when a return is calculated for an entire episode. In Q-learning different state action pairs are getting updates to their q-values at each step, so there's no sort of monte-carlo return that's being calculated back through an entire episode. Therefore is the purpose of gamma here just to make sure the Q-values converge in an infinite horizon case? submitted by /u/1cedrake [link] [comments]  ( 42 min )
    Why is PPO classified as a policy-based method?
    Hello, I'm fairly new to RL and I'm trying to understand the concepts. I saw the spinningup's classification of RL algorithms and noticed that PPO is classified as a Policy based method. However, I read that PPO has both actor and critic networks and [I] considered it to be a hybrid method. I was wondering that if PPO trains both policy and value networks, why is it considered to be a value-based method? what is the difference between SAC and PPO that SAC is hybrid and PPO is not? Thanks in advance ​ https://preview.redd.it/r05j39rx69ga1.png?width=987&format=png&auto=webp&s=e230bf62944d9d2572e5fe6fbe204ef5eba250b0 submitted by /u/ahmadreza_hadi [link] [comments]  ( 42 min )
    Question on Q-Learning paper
    I've been reading this paper (Financial Trading as a Game: A Deep Reinforcement Learning Approach) and have been wondering about something they try in there. I'm still quite new to Q Learning so maybe I've yet to fully understand things... However, they propose a scheme where at each time step they can calculate the reward for the step that was taken and also the rewards for the other possible actions at that step. Intuitively it makes a lot of sense to me - we can learn more from each step without having to do more random exploration. But I immediately thought that there are probably quite a few areas of RL where we could benefit from the same thing (outside of financial trading). So my question is, given how intuitively smart this approach seems, why isn't it more broadly adopted already? What did I miss? submitted by /u/jarym [link] [comments]  ( 43 min )
    "Autonomous navigation of stratospheric balloons using reinforcement learning", Bellemare et al 2020 [Repost]
    submitted by /u/goolulusaurs [link] [comments]  ( 41 min )
    Why does this PPO implementation calculate the Advantage only once per rollout?
    I am looking at this PPO implementation, which follows the pseudocode given in Spinning Up. This implementation has been really easy to follow and I understand almost everything. However, I am lost in line 103, where the author computes the normalized advantage before the rollout - A_k = (A_k - A_k.mean()) / (A_k.std() + 1e-10) Moreover, within the rollout loop, the author goes ahead to recalculate the value, but uses the original advantage while computing the surrogate losses - for _ in range(self.n_updates_per_iteration): # ALG STEP 6 & 7 # Calculate V_phi and pi_theta(a_t | s_t) V, curr_log_probs = self.evaluate(batch_obs, batch_acts) # Calculate surrogate losses. surr1 = ratios * A_k surr2 = torch.clamp(ratios, 1 - self.clip, 1 + self.clip) * A_k The author also wrote a medium article about this implementation and wrote the following - ​ https://preview.redd.it/195482c373ga1.png?width=845&format=png&auto=webp&s=0624c10056311ce9d31b98fd4563a8f7acff39f8 ​ But in the rollouts, the author updates V(value) without updating A (the advantage). submitted by /u/Academic-Rent7800 [link] [comments]  ( 44 min )
    Minimax with neural network evaluation function
    Is this a thing? To combine game tree search like minimax (or alpha-beta pruning) with neural networks that model the value function of a state? I think Alpha Go did something similar but with Monre Carlo Search Trees and it also had a policy network. How would I go on about training said neural network? I am thinking, first as a supervised task where the target values are heuristic evaluation functions and then finw tuning with some kind of RL but I don't know what. submitted by /u/SupremeChampionOfDi [link] [comments]  ( 44 min )
  • Open

    The Chinese room argument holds that a digital computer executing a program cannot have a "mind", "understanding", or "consciousness", regardless of how intelligently or human-like the program may make the computer behave
    submitted by /u/insaneintheblain [link] [comments]  ( 41 min )
    What do you think are the hard limitations of AI?
    I saw recently a lot of roadblocks that we thought AI will struggle with (Like making art) have easier crossed even with the narrow AI (ML) we have. I feel a lot of limits we thought ai might have like thinking outside the box, understanding concepts, self-awareness, or lacking a 'soul' are all kinda subjective that can be overcome with the invention of AGI and ASI in the coming decades. Then they will grow behind human comprehension. So are there actual hard limitations (if any) Ais will encounter that are actually very hard or maybe never able to overcome? submitted by /u/uswhole [link] [comments]  ( 41 min )
    AI Dream 126 - AI MindStorm (2/6) - EPIC journey
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Can Apple, Amazon or Google create a same chatbot as ChatGPT
    GhatGPT is an amazing tool no doubt even in the prototype version and they are continuously improving at an incredible rate. But the thing is, all these three giants (Apple, Amazon and Google) have their incredible voice assistances. They do similar job of finding a different requested results on web etc. Apple is already adding new health support features. Google maps might make it more helpful for searches related to geographical regions. And again they probably have more user data from millions of users. What make it so different from them and how difficult it would be for them to create their own version as they have a lot more resources and data. submitted by /u/JaurasiD [link] [comments]  ( 41 min )
    Found another directory for AI tools
    submitted by /u/simplir [link] [comments]  ( 40 min )
    Midjourney AI has a new "/blend" feature! pretty cool
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    My best animation so far, everything I've learned is in this one!
    submitted by /u/LincolnOsiris_ [link] [comments]  ( 40 min )
    Amazon is Adding 1000 Robots a Day to Its Workforce
    submitted by /u/Flaky_Preparation_50 [link] [comments]  ( 40 min )
    CMU Researchers Introduce FROMAGe: An AI Model That Efficiently Bootstraps Frozen Large Language Models (LLMs) To Generate Free-Form Text Interleaved With Images
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    How to reproduce any human voice
    submitted by /u/visimens-technology [link] [comments]  ( 40 min )
    Easy guide for DreamBooth training and prompts quick on your mobile device with iSee app
    submitted by /u/Wonderful_Neat_1549 [link] [comments]  ( 44 min )
    AskReddit: Looking for an open-source text2music or text2audio model for a web-app project. Early stage in AI discovery, any help much appreciated!
    I'm working on an AI project and looking for an open-source text2music or text2audio model that I can incorporate into a website to experiment with. I'm aware of models like MusicLM and VALL-E but those haven't been released as APIs yet. I've also come across AudioGen and Mousai but same issues there. Does anyone happen to have a suggestion for a text2music model that is OSS and fairly accessible to incorporate into a web app? Looking for the best OSS model out there, but also open to the best text2audio that anyone would recommend - just looking to get text2audio working. Any suggestions from prior experience would be hugely helpful. Thanks very much in advance! submitted by /u/dmalikmusic [link] [comments]  ( 41 min )
    3D aware image synthesis with a spherical background — BALLGAN
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 42 min )
    ChatGPT’s Explosive Popularity Makes It the Fastest-Growing App in Human History
    submitted by /u/Tao_Dragon [link] [comments]  ( 41 min )
    My course on creating a ChatGPT Chrome Extension for GMail, would love your feedback!
    https://www.udemy.com/course/chatgpt-bot/?couponCode=5-DAYS-FREE Hey everyone, I recently made a course about ChatGPT as a fun passion project. This is for anyone who wants to learn how to create automated workflows (using Chrome extensions) with ChatGPT. Specifically, you will create a ChatGPT bot that automatically answers your emails. It is beginner friendly and includes getting some good practice with JavaScript. I hope you enjoy it and I'm looking forward to your feedback/questions :) submitted by /u/neuromodel [link] [comments]  ( 41 min )
    BoyWithUke AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    With apps like Lensa AI leading many to doubt AI's creativity, this TED talk is more relevant than ever
    https://www.youtube.com/watch?v=8TOgN-U0ask&t=1s After the Lensa AI controversy led many people to question whether AI really is creative or is it just "remixing" other artists' copyrighted work used with permission, it has led many to wonder whether AI trained on copyrighted images should be illegal. This talk makes some interesting comparisons which might just mean the answer is no. submitted by /u/BearNo21 [link] [comments]  ( 41 min )
    OpenAI Is Reportedly Launching A ChatGPT App For Android And iOS. Here’s What We Know So Far.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    1000+ AI tools catalog - any feedback?
    I'm creating https://domore.ai/ - a catalog of 1000+ AI tools. The goal is to provide individuals and organizations with the latest information on AI tools. I'd love to hear any feedback you have for me, so feel free to share your thoughts :) submitted by /u/bart_so [link] [comments]  ( 41 min )
    Rasa entity detection
    I've been trying rasa for some time and got into a problem, the ai detects the intents perfectly but it doesn't pick up the entities. anyone can help me with it? submitted by /u/skychi_ [link] [comments]  ( 40 min )
    OpenAI to Launch ChatGPT Mobile App
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    Researchers at Stanford Introduce Parsel: An Artificial Intelligence AI Framework That Enables Automatic Implementation And Validation of Complex Algorithms With Code Large Language Models LLMs
    submitted by /u/madskills42001 [link] [comments]  ( 41 min )
  • Open

    Validation Accuracy Fluctuation
    Hello everyone, I am trying to implement DenseNet from scratch with some improvement for my project. When I fit my model, I am seeing that my validation accuracy fluctuating but my test accuracy is almost 96%. Is this fluctuation sign of the overfitting or how can I comment about this? I would be appreciate for any help. Thank you in advance. ​ ​ https://preview.redd.it/jtybvrt3z8ga1.png?width=778&format=png&auto=webp&s=75ec1a2ab9e2b60854bc92f361e203d825e9ca62 submitted by /u/Hungry-Engineer-5696 [link] [comments]  ( 41 min )
    BoyWithUke AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    AI Learns the Numbers
    submitted by /u/keghn [link] [comments]  ( 40 min )
  • Open

    [D] Mixing metadata and text in embedding for KNN search?
    Say I wanted to do a KNN similarity search, using a text embedding from a block of text from a PDF. But I also want to find these documents written at a similar time, and with similar title and author. Would it make sense to prepend a written form of this metadata ahead of the document text, and sending that to the embedding? Like “this document titled ABC was written by XYZ on January 1, 2020”. Or would it be better to create a separate embedding for the metadata, and merge the embeddings afterwards? submitted by /u/DeadPukka [link] [comments]  ( 43 min )
    [Project] ideas NLP
    Looking for ideas to start an NLP project, I'd like to explore something not too mainstream or novel to some extent, any ideas or datasets I should check out? submitted by /u/mems_m [link] [comments]  ( 42 min )
    [N] [R] Google announces Dreamix: a model that generates videos when given a prompt and an input image/video.
    submitted by /u/radi-cho [link] [comments]  ( 44 min )
    [P] NLP Q&A Bot Project Guidance
    I have performed below steps and require guidance to proceed further I have extracted and preprocessed the text from PDFs. Performed NER on the extracted text and created a data frame of entities. Created a function to preprocess the query and identified the entities in the question. Now I need guidance or any reference to perform the below steps. Match the entities from the question with the entities in the PDF text and retrieve the paragraph ? Calculate the similarity score for each paragraph and display the relevant paragraph Generate answer from the identified paragraph ? Please also guide me if the approach followed is correct or not ? submitted by /u/sasi_0212 [link] [comments]  ( 43 min )
    [D] Could you use SVD for supervised learning?
    It seems like Singular Value Decomposition is only used for unsupervised learning when trying to reduce the number of features in a high dimensional dataset, but I was wondering why I don't see any articles or literature on using SVD for supervised learning. I know that using a regularization function like Lasso (L1) can get rid of irrelevant features, but I don't see why SVD wouldn't be helpful too. submitted by /u/TemperatureOk6810 [link] [comments]  ( 43 min )
    [R] Coinductive guide to inductive transformer heads
    submitted by /u/adamnemecek [link] [comments]  ( 42 min )
    [R] Grounding Language Models to Images for Multimodal Generation
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [R] 3D aware image synthesis with a spherical background — BALLGAN
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 43 min )
    please help a bunch of students?(with pre annotated data set) we were assigned to this task with no prior knowledge of ML i don't know where to begin with we tried a couple of method which ultimately failed id be thankful for anyone who would tell me in steps what to do with this data[D]
    submitted by /u/errorr_unknown [link] [comments]  ( 42 min )
    [P] What tools are available for labelling data for LayoutLMv3?
    I have been working on information extraction from documents, but what I got to know is there are not enough free tools available for labelling data for these kind of tasks. Are there any free tools available for labelling data for LayoutLM models? submitted by /u/TensorDudee [link] [comments]  ( 42 min )
    15 years old and bad at math [D]
    Hi, I've done some programming type things before and I'm interested in learning about ML, to be able to make some basic projects with ML, how good does my math need to be. I get As at school in math but I know that what I'm learning now is pretty basic. I'm just wondering whether I should try learn about ML or wait a few more years for my math skills to improve submitted by /u/Daniel_C_____ [link] [comments]  ( 45 min )
    [N] GitHub CEO on why open source developers should be exempt from the EU’s AI Act
    submitted by /u/EmbarrassedHelp [link] [comments]  ( 43 min )
    [R] Bilingual (or Multilingual) Large Language models are the key to human parity on machine translations even for difficult language pairs and domains (e.g literature). An English-Chinese comparison.
    submitted by /u/MysteryInc152 [link] [comments]  ( 42 min )
    [R] Chinchilla data-optimal scaling laws: In plain English
    submitted by /u/adt [link] [comments]  ( 42 min )
    [D] Purchasing Google Colab Pro
    Hi everyone, I'm currently knees-deep in a ML project with a friend (~4 months of development) and my free compute units on Colab finally ran out. After searching for alternatives, and finding none that work as smoothly as Colab, we've considered to buy a Pro subscription. My question is: How can I share the compute units I'll get from Colab Pro with said friend? Don't want to make the purchase and later realize that I'm the only person with access to those compute units. submitted by /u/RaphDaPingu [link] [comments]  ( 43 min )
    Information Retrieval book recommendations? [D]
    Maybe not a Machine Learning question, but I'm searching for good books about information retrieval. The two primary ones I can find are: - Introduction to Information Retrieval (2008) - Information Retrieval - Implementing and Evaluating Search Engines (2016) ​ They seem a bit old for 2023, but they may still be useful? Do you have any good book recommendations? submitted by /u/Ggronne [link] [comments]  ( 44 min )
    [R] What’s your suggestion for offline RL?
    Hi guys! I read a lot of offline RL papers in last Fall semester and choose it as my course project. Offline RL seems to be a very hot topic in recent years, I believe that the major challenge for offline RL are (i) distribution shift and (ii) overestimation. The second challenge is caused by (i), because the learners/agents will never allow to interact with the true environment and they will too optimistic for unseen state-actions. Hence, there are many papers to address such challenges, e.g., CQL and MOPO. However can these methods handle misleading datasets? Consider the following example. Suppose we have only one state (MAB) and two arms. The reward of the first arm will return 2/3 with probability 1 and the reward model of second arm is Bernoulli distribution with p=1/2. Clearly, choosing the first arm is the best choice. Now, for the dataset, unfortunately, all samples on the second arm received reward 1. Because the agent only can access this misleading dataset, if we use Bayesian methods, then the posterior will give a high score for the second arm. If we use Lower Confidence Bound, we need to count the occurrence of each arm. Then, this is very hard to extend this method to MDPs with arbitrary large state and action space. So, does anyone know a function can capture this uncertainty (caused by the dataset) or can any methods to tell the learner that you’re in a very misleading situation? submitted by /u/AndyMeowMeow [link] [comments]  ( 44 min )

  • Open

    [N] FT: Google invests $300mn in artificial intelligence start-up Anthropic
    From the Financial Times: https://www.ft.com/content/583ead66-467c-4bd5-84d0-ed5df7b5bf9c Unpaywalled: https://archive.is/ciZPV I guess I'm a little surprised, this feels like Google backing a competitor to 1) their own Google Brain teams, and 2) Deepmind. The cynical take might be that they're trying to lock in Anthropic; the same way Microsoft locked in OpenAI. submitted by /u/bikeskata [link] [comments]  ( 47 min )
    [R] Topologically evolving new self-modifying multi-task learning algorithms
    I’ve been developing this idea since I first thought of it in mid December last year. Here’s the elevator pitch (skip to how for technical details): Why? Existing models and learning algorithms are extremely static and unable to generalize across tasks as well as humans or to adapt well to new / changing business requirements. This even applies to the final solutions in recent AutoML (see An Empirical Review of Automated Machine Learning, AutoML: A survey of the state-of-the-art). Beyond being static, most suffer from a need for high-performance systems with large amounts of compute and/or memory. This static and bloated nature not only limits the reusability of code, pipelines and all the computations that went into previous versions of a model architecture upon finding a better one. It…  ( 47 min )
    [R] Multimodal Chain-of-Thought Reasoning in Language Models - Amazon Web Services Zhuosheng Zhang et al - Outperforms GPT-3.5 by 16% (75%->91%) and surpasses human performance on ScienceQA while having less than 1B params!
    Paper: https://arxiv.org/abs/2302.00923 Github: https://github.com/amazon-science/mm-cot Twitter: https://paperswithcode.com/top-social Abstract: Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies are mostly isolated in the language modality with LLMs, where LLMs are hard to deploy. To elicit CoT reasoning in multimodality, a possible solution is to fine-tune small language models by fusing the vision and language features to perform CoT reasoning. The key challenge is that those language models tend to generate hallucinated reasoning chains that mislead the answer inference. To mitigate the effect of such mistakes, we propose Multimodal-CoT that incorporates vision features in a decoupled training framework. The framework separates the rationale generation and answer inference into two stages. By incorporating the vision features in both stages, the model is able to generate effective rationales that contribute to answer inference. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16% (75.17%->91.68%) on the ScienceQA benchmark and even surpasses human performance. https://preview.redd.it/g9eo0f94k1ga1.jpg?width=1331&format=pjpg&auto=webp&s=9b5fc84b424aff7160b69ff7c7a5fad071cbb7d2 https://preview.redd.it/fgboci94k1ga1.jpg?width=1323&format=pjpg&auto=webp&s=35215544d9e0a74881c42503d04b62ab09081af1 https://preview.redd.it/2ojfym94k1ga1.jpg?width=1660&format=pjpg&auto=webp&s=cf040c4f422f6c323e8c4d75474a5881f45a41d1 https://preview.redd.it/k7huem94k1ga1.jpg?width=1326&format=pjpg&auto=webp&s=f4326a5088744d3856e5c5c23311be6348fab924 https://preview.redd.it/05m8rf94k1ga1.jpg?width=658&format=pjpg&auto=webp&s=ac4110e57a49fcea6f8c03571edd391ff71bd13d submitted by /u/Singularian2501 [link] [comments]  ( 47 min )
    [P] I trained an AI model on 120M+ songs from iTunes
    Hey ML Reddit! I just shipped a project I’ve been working on called Maroofy: https://maroofy.com You can search for any song, and it’ll use the song’s audio to find other similar-sounding music. Demo: https://twitter.com/subby_tech/status/1621293770779287554 How does it work? I’ve indexed ~120M+ songs from the iTunes catalog with a custom AI audio model that I built for understanding music. My model analyzes raw music audio as input and produces embedding vectors as output. I then store the embedding vectors for all songs into a vector database, and use semantic search to find similar music! Here are some examples you can try: Fetish (Selena Gomez feat. Gucci Mane) — https://maroofy.com/songs/1563859943 The Medallion Calls (Pirates of the Caribbean) — https://maroofy.com/songs/1440649752 Hope you like it! This is an early work in progress, so would love to hear any questions/feedback/comments! :D submitted by /u/BullyMaguireJr [link] [comments]  ( 52 min )
    [N] Google Open Sources Vizier, Hyperparameter + Blackbox Optimization Service at Scale
    Github: https://github.com/google/vizier Google AI Blog: https://ai.googleblog.com/2023/02/open-source-vizier-towards-reliable-and.html Tweet from Zoubin Ghahramani: https://twitter.com/ZoubinGhahrama1/status/1621321675936768000?s=20&t=ZEuz9oSc_GWYxixtXDskqA submitted by /u/enderlayer [link] [comments]  ( 43 min )
    [D] Understanding Vision Transformer (ViT) - What are the prerequisites?
    Hello everyone, I'm interested in diving into the field of computer vision and I recently came across the concept of Vision Transformer (ViT). I want to understand this concept in depth but I'm not sure what prerequisites I need to have in order to grasp the concept fully. Do I need to have a strong background in Recurrent Neural Networks (RNNs) and Transformer (Attention Is All You Need) to understand ViT, or can I get by just knowing the basics of deep learning and Convolutional Neural Networks (CNNs)? I would really appreciate if someone could shed some light on this and provide some guidance. Thank you in advance! submitted by /u/SAbdusSamad [link] [comments]  ( 7 min )
  • Open

    Created an AI research assistant where you can ask questions about any file (i.e. technical paper, report, etc) in English and automatically get the answer. It's like ChatGPT for your files.
    submitted by /u/HamletsLastLine [link] [comments]  ( 46 min )
    Ilya Sutskever says 40 papers explain 90% of modern AI
    In this article (https://dallasinnovates.com/exclusive-qa-john-carmacks-different-path-to-artificial-general-intelligence/) there is a quote from John Carmack that read: "I asked Ilya Sutskever, OpenAI’s chief scientist, for a reading list. He gave me a list of like 40 research papers and said, ‘If you really learn all of these, you’ll know 90% of what matters today. " My question is, what are these 40 papers? submitted by /u/Gryphx [link] [comments]  ( 42 min )
    Chat with your favorite characters from movies, TV shows, books, history, and more.
    ​ sample chat with my annoyed neighbor I built ChatFAI about a month ago. It's a simple web app that allows you to interact with your favorite characters from movies, TV shows, books, history, and beyond. People are having fun talking to whomever they want to talk to. There is a public characters library and you can also create custom characters based on anyone (or even your imagination). I have been actively improving it and have made it much better recently. So, I wanted to share it here to get feedback. The reason for sharing it here is I want feedback from you all. Let me know if there is anything else I should add or change. Here it is: https://chatfai.com submitted by /u/usamaejazch [link] [comments]  ( 42 min )
  • Open

    Augmented Lagrangian method for constrained MDP or constrained RL?
    Is there any work on applying Augmented Lagrangian method to constrained MDP problems that guarantee the constraint satisfaction as iterations goes? I tried to find but haven't got much result yet. Thanks for sharing any hints! submitted by /u/Sad-Dragonfruit-274 [link] [comments]  ( 41 min )
    reward function
    Hi, my agent is not working well, I feel like my reward function is not efficient. I'm trying to solve a control problem using the reinforcemnt learning, so my reward function was made with the state cost function. for example, reward = previous cost - current cost. so if the agent gets closer to the destination or makes any better control, it receives a positive reward. Otherwise, it will get a negative reward. but I don't think this is efficient.. Can anyone give me advice? Thanks all ​ https://preview.redd.it/rzf78ykzs0ga1.png?width=559&format=png&auto=webp&s=3f4cadb96f16eec59d221ab2b7f50f6839cf8ab6 submitted by /u/sonlightinn [link] [comments]  ( 42 min )
    Does anyone know of any model-based algorithms that deal with imperfect information and stochasticity and don't require a simulator?
    submitted by /u/atomicburn125 [link] [comments]  ( 42 min )
    Why does Advantage Learning help function approximators?
    Can someone please help with this question - https://ai.stackexchange.com/questions/39029/why-does-advantage-learning-help-function-approximators submitted by /u/Academic-Rent7800 [link] [comments]  ( 43 min )
  • Open

    Real-time tracking of wildfire boundaries using satellite imagery
    Posted by Zvika Ben-Haim and Omer Nevo, Software Engineers, Google Research As global temperatures rise, wildfires around the world are becoming more frequent and more dangerous. Their effects are felt by many communities as people evacuate their homes or suffer harm even from proximity to the fire and smoke. As part of Google’s mission to help people access trusted information in critical moments, we use satellite imagery and machine learning (ML) to track wildfires and inform affected communities. Our wildfire tracker was recently expanded. It provides updated fire boundary information every 10–15 minutes, is more accurate than similar satellite products, and improves on our previous work. These boundaries are shown for large fires in the continental US, Mexico, and most of Cana…  ( 92 min )
  • Open

    IoT Project: Why Is .NET The Best Choice?
    As the Internet of Things (IoT) continues to gain more traction at a rapid pace, there is growing demand and need for the development of apps driven by this technology. However, this leaves businesses with a challenging question: which development tool to use for creating such apps? The simple answer is .NET. It is a… Read More »IoT Project: Why Is .NET The Best Choice? The post IoT Project: Why Is .NET The Best Choice? appeared first on Data Science Central.  ( 19 min )
  • Open

    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v2 [cs.LG] UPDATED)
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.  ( 2 min )
    Additive Higher-Order Factorization Machines. (arXiv:2205.14515v2 [stat.CO] UPDATED)
    In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.  ( 2 min )
    Real Estate Property Valuation using Self-Supervised Vision Transformers. (arXiv:2302.00117v1 [cs.CV])
    The use of Artificial Intelligence (AI) in the real estate market has been growing in recent years. In this paper, we propose a new method for property valuation that utilizes self-supervised vision transformers, a recent breakthrough in computer vision and deep learning. Our proposed algorithm uses a combination of machine learning, computer vision and hedonic pricing models trained on real estate data to estimate the value of a given property. We collected and pre-processed a data set of real estate properties in the city of Boulder, Colorado and used it to train, validate and test our algorithm. Our data set consisted of qualitative images (including house interiors, exteriors, and street views) as well as quantitative features such as the number of bedrooms, bathrooms, square footage, lot square footage, property age, crime rates, and proximity to amenities. We evaluated the performance of our model using metrics such as Root Mean Squared Error (RMSE). Our findings indicate that these techniques are able to accurately predict the value of properties, with a low RMSE. The proposed algorithm outperforms traditional appraisal methods that do not leverage property images and has the potential to be used in real-world applications.  ( 2 min )
    Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks. (arXiv:2211.11869v2 [cs.LG] UPDATED)
    This effort is focused on examining the behavior of reinforcement learning systems in personalization environments and detailing the differences in policy entropy associated with the type of learning algorithm utilized. We demonstrate that Policy Optimization agents often possess low-entropy policies during training, which in practice results in agents prioritizing certain actions and avoiding others. Conversely, we also show that Q-Learning agents are far less susceptible to such behavior and generally maintain high-entropy policies throughout training, which is often preferable in real-world applications. We provide a wide range of numerical experiments as well as theoretical justification to show that these differences in entropy are due to the type of learning being employed.  ( 2 min )
    Expanding the Deployment Envelope of Behavior Prediction via Adaptive Meta-Learning. (arXiv:2209.11820v3 [cs.LG] UPDATED)
    Learning-based behavior prediction methods are increasingly being deployed in real-world autonomous systems, e.g., in fleets of self-driving vehicles, which are beginning to commercially operate in major cities across the world. Despite their advancements, however, the vast majority of prediction systems are specialized to a set of well-explored geographic regions or operational design domains, complicating deployment to additional cities, countries, or continents. Towards this end, we present a novel method for efficiently adapting behavior prediction models to new environments. Our approach leverages recent advances in meta-learning, specifically Bayesian regression, to augment existing behavior prediction models with an adaptive layer that enables efficient domain transfer via offline fine-tuning, online adaptation, or both. Experiments across multiple real-world datasets demonstrate that our method can efficiently adapt to a variety of unseen environments.  ( 2 min )
    Sliced Optimal Partial Transport. (arXiv:2212.08049v4 [cs.LG] UPDATED)
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.  ( 2 min )
    PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets. (arXiv:2302.00586v1 [q-fin.TR])
    The financial markets, which involve more than $90 trillion in market capitalization, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) emerges as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focus on profit-related measures, which are far from satisfactory for practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages Mixture-of-Experts (MoE) and risk-10 sensitive approaches to make diversified risk-aware investment decisions, ii) we11 evaluate 8 widely used FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass1 together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management RL environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment.  ( 2 min )
    Statistical Inference After Adaptive Sampling for Longitudinal Data. (arXiv:2202.07098v2 [cs.LG] UPDATED)
    Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by a large class of adaptive sampling algorithms that are designed to optimize treatment decisions online using accruing data from multiple users. Combining or "pooling" data across users allows adaptive sampling algorithms to potentially learn faster. However, by pooling, these algorithms induce dependence between the collected user data trajectories; we show that this can cause standard variance estimators for i.i.d. data to underestimate the true variance of common estimators on this data type. We develop novel methods to perform a variety of statistical analyses on such adaptively collected data via Z-estimation. Specifically, we introduce the adaptive sandwich variance estimator, a corrected sandwich estimator that leads to consistent variance estimates under adaptive sampling. Additionally, to prove our results we develop significant theory for empirical processes on non-i.i.d., adaptively collected, longitudinal data. This work is motivated by our efforts in designing experiments in which online reinforcement learning algorithms pool data across users to learn to optimize treatment decisions, yet reliable statistical inference is essential for conducting a variety of statistical analyses after the experiment is over.  ( 2 min )
    Thermal Heating in ReRAM Crossbar Arrays: Challenges and Solutions. (arXiv:2212.13707v2 [cs.AR] UPDATED)
    The higher speed, scalability and parallelism offered by ReRAM crossbar arrays foster development of ReRAM-based next generation AI accelerators. At the same time, sensitivity of ReRAM to temperature variations decreases R_on/Roff ratio and negatively affects the achieved accuracy and reliability of the hardware. Various works on temperature-aware optimization and remapping in ReRAM crossbar arrays reported up to 58\% improvement in accuracy and 2.39$\times$ ReRAM lifetime enhancement. This paper classifies the challenges caused by thermal heat, starting from constraints in ReRAM cells' dimensions and characteristics to their placement in the architecture. In addition, it reviews available solutions designed to mitigate the impact of these challenges, including emerging temperature-resilient DNN training methods. Our work also provides a summary of the techniques and their advantages and limitations.  ( 2 min )
    Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees. (arXiv:2302.00037v1 [cs.LG])
    Hierarchical Clustering is a popular unsupervised machine learning method with decades of history and numerous applications. We initiate the study of differentially private approximation algorithms for hierarchical clustering under the rigorous framework introduced by (Dasgupta, 2016). We show strong lower bounds for the problem: that any $\epsilon$-DP algorithm must exhibit $O(|V|^2/ \epsilon)$-additive error for an input dataset $V$. Then, we exhibit a polynomial-time approximation algorithm with $O(|V|^{2.5}/ \epsilon)$-additive error, and an exponential-time algorithm that meets the lower bound. To overcome the lower bound, we focus on the stochastic block model, a popular model of graphs, and, with a separation assumption on the blocks, propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly. Finally, we perform an empirical study of our algorithms and validate their performance.  ( 2 min )
    ImpressLearn: Continual Learning via Combined Task Impressions. (arXiv:2210.01987v2 [cs.CV] UPDATED)
    This work proposes a new method to sequentially train deep neural networks on multiple tasks without suffering catastrophic forgetting, while endowing it with the capability to quickly adapt to unseen tasks. Starting from existing work on network masking (Wortsman et al., 2020), we show that simply learning a linear combination of a small number of task-specific supermasks (impressions) on a randomly initialized backbone network is sufficient to both retain accuracy on previously learned tasks, as well as achieve high accuracy on unseen tasks. In contrast to previous methods, we do not require to generate dedicated masks or contexts for each new task, instead leveraging transfer learning to keep per-task parameter overhead small. Our work illustrates the power of linearly combining individual impressions, each of which fares poorly in isolation, to achieve performance comparable to a dedicated mask. Moreover, even repeated impressions from the same task (homogeneous masks), when combined, can approach the performance of heterogeneous combinations if sufficiently many impressions are used. Our approach scales more efficiently than existing methods, often requiring orders of magnitude fewer parameters and can function without modification even when task identity is missing. In addition, in the setting where task labels are not given at inference, our algorithm gives an often favorable alternative to the one-shot procedure used by Wortsman et al., 2020. We evaluate our method on a number of well-known image classification datasets and network architectures.  ( 2 min )
    Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data. (arXiv:2302.00674v1 [cs.LG])
    Few-shot learning involves learning an effective model from only a few labeled datapoints. The use of a small training set makes it difficult to avoid overfitting but also makes few-shot learning applicable to many important real-world settings. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization. Introducing auxiliary data during few-shot learning leads to essential design choices where hand-designed heuristics can lead to sub-optimal performance. In this work, we focus on automated sampling strategies for FLAD and relate them to the explore-exploit dilemma that is central in multi-armed bandit settings. Based on this connection we propose two algorithms -- EXP3-FLAD and UCB1-FLAD -- and compare them with methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. Using our proposed algorithms to train T5 yields a 9% absolute improvement over the explicitly multi-task pre-trained T0 model across 11 datasets.
    Learning Equilibria in Matching Markets from Bandit Feedback. (arXiv:2108.08843v2 [cs.LG] UPDATED)
    Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v3 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning. (arXiv:2302.00543v1 [cs.LG])
    Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.
    A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel. (arXiv:2206.12543v2 [stat.ML] UPDATED)
    Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size $NO \times NO$, taking $O((NO)^2)$ memory and up to $O((NO)^3)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits", converges to the true eNTK at initialization for any network with a wide final "readout" layer. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.
    The Replicator Dynamic, Chain Components and the Response Graph. (arXiv:2209.15230v2 [cs.GT] UPDATED)
    In this paper we examine the relationship between the flow of the replicator dynamic, the continuum limit of Multiplicative Weights Update, and a game's response graph. We settle an open problem establishing that under the replicator, sink chain components -- a topological notion of long-run outcome of a dynamical system -- always exist and are approximated by the sink connected components of the game's response graph. More specifically, each sink chain component contains a sink connected component of the response graph, as well as all mixed strategy profiles whose support consists of pure profiles in the same connected component, a set we call the content of the connected component. As a corollary, all profiles are chain recurrent in games with strongly connected response graphs. In any two-player game sharing a response graph with a zero-sum game, the sink chain component is unique. In two-player zero-sum and potential games the sink chain components and sink connected components are in a one-to-one correspondence, and we conjecture that this holds in all games.
    Development of deep biological ages aware of morbidity and mortality based on unsupervised and semi-supervised deep learning approaches. (arXiv:2302.00319v1 [cs.LG])
    Background: While deep learning technology, which has the capability of obtaining latent representations based on large-scale data, can be a potential solution for the discovery of a novel aging biomarker, existing deep learning methods for biological age estimation usually depend on chronological ages and lack of consideration of mortality and morbidity that are the most significant outcomes of aging. Methods: This paper proposes a novel deep learning model to learn latent representations of biological aging in regard to subjects' morbidity and mortality. The model utilizes health check-up data in addition to morbidity and mortality information to learn the complex relationships between aging and measured clinical attributes. Findings: The proposed model is evaluated on a large dataset of general populations compared with KDM and other learning-based models. Results demonstrate that biological ages obtained by the proposed model have superior discriminability of subjects' morbidity and mortality.
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v3 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
    Off-the-Grid MARL: a Framework for Dataset Generation with Baselines for Cooperative Offline Multi-Agent Reinforcement Learning. (arXiv:2302.00521v1 [cs.LG])
    Being able to harness the power of large, static datasets for developing autonomous multi-agent systems could unlock enormous value for real-world applications. Many important industrial systems are multi-agent in nature and are difficult to model using bespoke simulators. However, in industry, distributed system processes can often be recorded during operation, and large quantities of demonstrative data can be stored. Offline multi-agent reinforcement learning (MARL) provides a promising paradigm for building effective online controllers from static datasets. However, offline MARL is still in its infancy, and, therefore, lacks standardised benchmarks, baselines and evaluation protocols typically found in more mature subfields of RL. This deficiency makes it difficult for the community to sensibly measure progress. In this work, we aim to fill this gap by releasing \emph{off-the-grid MARL (OG-MARL)}: a framework for generating offline MARL datasets and algorithms. We release an initial set of datasets and baselines for cooperative offline MARL, created using the framework, along with a standardised evaluation protocol. Our datasets provide settings that are characteristic of real-world systems, including complex dynamics, non-stationarity, partial observability, suboptimality and sparse rewards, and are generated from popular online MARL benchmarks. We hope that OG-MARL will serve the community and help steer progress in offline MARL, while also providing an easy entry point for researchers new to the field.
    Uniswap Liquidity Provision: An Online Learning Approach. (arXiv:2302.00610v1 [cs.GT])
    Decentralized Exchanges (DEXs) are new types of marketplaces leveraging Blockchain technology. They allow users to trade assets with Automatic Market Makers (AMM), using funds provided by liquidity providers, removing the need for order books. One such DEX, Uniswap v3, allows liquidity providers to allocate funds more efficiently by specifying an active price interval for their funds. This introduces the problem of finding an optimal strategy for choosing price intervals. We formalize this problem as an online learning problem with non-stochastic rewards. We use regret-minimization methods to show a liquidity provision strategy that guarantees a lower bound on the reward. This is true even for non-stochastic changes to asset pricing, and we express this bound in terms of the trading volume.
    Code2Snapshot: Using Code Snapshots for Learning Representations of Source Code. (arXiv:2111.01097v3 [cs.SE] UPDATED)
    There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and compare its performance with state-of-the-art representations that utilize the rich syntactic and semantic features of input programs. Our preliminary study on the utility of Code2Snapshot in the code summarization and code classification tasks suggests that simple snapshots of input programs have comparable performance to state-of-the-art representations. Interestingly, obscuring input programs have insignificant impacts on the Code2Snapshot performance, suggesting that, for some tasks, neural models may provide high performance by relying merely on the structure of input programs.
    Cross-client Label Propagation for Transductive Federated Learning. (arXiv:2210.06434v2 [cs.LG] UPDATED)
    We present Cross-Client Label Propagation(XCLP), a new method for transductive federated learning. XCLP estimates a data graph jointly from the data of multiple clients and computes labels for the unlabeled data by propagating label information across the graph. To avoid clients having to share their data with anyone, XCLP employs two cryptographically secure protocols: secure Hamming distance computation and secure summation. We demonstrate two distinct applications of XCLP within federated learning. In the first, we use it in a one-shot way to predict labels for unseen test points. In the second, we use it to repeatedly pseudo-label unlabeled training data in a federated semi-supervised setting. Experiments on both real federated and standard benchmark datasets show that in both applications XCLP achieves higher classification accuracy than alternative approaches.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v2 [cs.LG] UPDATED)
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
    Short-term Prediction and Filtering of Solar Power Using State-Space Gaussian Processes. (arXiv:2302.00388v1 [cs.LG])
    Short-term forecasting of solar photovoltaic energy (PV) production is important for powerplant management. Ideally these forecasts are equipped with error bars, so that downstream decisions can account for uncertainty. To produce predictions with error bars in this setting, we consider Gaussian processes (GPs) for modelling and predicting solar photovoltaic energy production in the UK. A standard application of GP regression on the PV timeseries data is infeasible due to the large data size and non-Gaussianity of PV readings. However, this is made possible by leveraging recent advances in scalable GP inference, in particular, by using the state-space form of GPs, combined with modern variational inference techniques. The resulting model is not only scalable to large datasets but can also handle continuous data streams via Kalman filtering.
    WISE: Wavelet Transformation for Boosting Transformers' Long Sequence Learning Ability. (arXiv:2210.01989v2 [cs.CL] UPDATED)
    Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with a linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. Specifically, we focus on a new paradigm WISE, which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting non-linear transformations in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena benchmark demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer's performance and also significantly outperform Fourier-based methods.
    Graph Neural Network Based Surrogate Model of Physics Simulations for Geometry Design. (arXiv:2302.00557v1 [cs.LG])
    Computational Intelligence (CI) techniques have shown great potential as a surrogate model of expensive physics simulation, with demonstrated ability to make fast predictions, albeit at the expense of accuracy in some cases. For many scientific and engineering problems involving geometrical design, it is desirable for the surrogate models to precisely describe the change in geometry and predict the consequences. In that context, we develop graph neural networks (GNNs) as fast surrogate models for physics simulation, which allow us to directly train the models on 2/3D geometry designs that are represented by an unstructured mesh or point cloud, without the need for any explicit or hand-crafted parameterization. We utilize an encoder-processor-decoder-type architecture which can flexibly make prediction at both node level and graph level. The performance of our proposed GNN-based surrogate model is demonstrated on 2 example applications: feature designs in the domain of additive engineering and airfoil design in the domain of aerodynamics. The models show good accuracy in their predictions on a separate set of test geometries after training, with almost instant prediction speeds, as compared to O(hour) for the high-fidelity simulations required otherwise.
    Deep learning for $\psi$-weakly dependent processes. (arXiv:2302.00333v1 [stat.ML])
    In this paper, we perform deep neural networks for learning $\psi$-weakly dependent processes. Such weak-dependence property includes a class of weak dependence conditions such as mixing, association,$\cdots$ and the setting considered here covers many commonly used situations such as: regression estimation, time series prediction, time series classification,$\cdots$ The consistency of the empirical risk minimization algorithm in the class of deep neural networks predictors is established. We achieve the generalization bound and obtain a learning rate, which is less than $\mathcal{O}(n^{-1/\alpha})$, for all $\alpha > 2 $. Applications to binary time series classification and prediction in affine causal models with exogenous covariates are carried out. Some simulation results are provided, as well as an application to the US recession data.
    Fast and realistic large-scale structure from machine-learning-augmented random field simulations. (arXiv:2205.07898v2 [astro-ph.CO] UPDATED)
    Producing thousands of simulations of the dark matter distribution in the Universe with increasing precision is a challenging but critical task to facilitate the exploitation of current and forthcoming cosmological surveys. Many inexpensive substitutes to full $N$-body simulations have been proposed, even though they often fail to reproduce the statistics of the smaller, non-linear scales. Among these alternatives, a common approximation is represented by the lognormal distribution, which comes with its own limitations as well, while being extremely fast to compute even for high-resolution density fields. In this work, we train a generative deep learning model, mainly made of convolutional layers, to transform projected lognormal dark matter density fields to more realistic dark matter maps, as obtained from full $N$-body simulations. We detail the procedure that we follow to generate highly correlated pairs of lognormal and simulated maps, which we use as our training data, exploiting the information of the Fourier phases. We demonstrate the performance of our model comparing various statistical tests with different field resolutions, redshifts and cosmological parameters, proving its robustness and explaining its current limitations. When evaluated on 100 test maps, the augmented lognormal random fields reproduce the power spectrum up to wavenumbers of $1 \ h \ \rm{Mpc}^{-1}$, and the bispectrum within 10%, and always within the error bars, of the fiducial target simulations. Finally, we describe how we plan to integrate our proposed model with existing tools to yield more accurate spherical random fields for weak lensing analysis.
    A Comprehensive Survey of Continual Learning: Theory, Method and Application. (arXiv:2302.00487v1 [cs.LG])
    To cope with real-world dynamics, an intelligent agent needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance drop of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of continual learning in terms of the current trends, cross-directional prospects and interdisciplinary connections with neuroscience, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.
    Deterministic equivalent and error universality of deep random features learning. (arXiv:2302.00401v1 [stat.ML])
    This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
    Improved Exact and Heuristic Algorithms for Maximum Weight Clique. (arXiv:2302.00458v1 [cs.DS])
    We propose improved exact and heuristic algorithms for solving the maximum weight clique problem, a well-known problem in graph theory with many applications. Our algorithms interleave successful techniques from related work with novel data reduction rules that use local graph structure to identify and remove vertices and edges while retaining the optimal solution. We evaluate our algorithms on a range of synthetic and real-world graphs, and find that they outperform the current state of the art on most inputs. Our data reductions always produce smaller reduced graphs than existing data reductions alone. As a result, our exact algorithm, MWCRedu, finds solutions orders of magnitude faster on naturally weighted, medium-sized map labeling graphs and random hyperbolic graphs. Our heuristic algorithm, MWCPeel, outperforms its competitors on these instances, but is slightly less effective on extremely dense or large instances.
    Accordion: A Communication-Aware Machine Learning Framework for Next Generation Networks. (arXiv:2302.00623v1 [cs.NI])
    In this article, we advocate for the design of ad hoc artificial intelligence (AI)/machine learning (ML) models to facilitate their usage in future smart infrastructures based on communication networks. To motivate this, we first review key operations identified by the 3GPP for transferring AI/ML models through 5G networks and the main existing techniques to reduce their communication overheads. We also present a novel communication-aware ML framework, which we refer to as Accordion, that enables an efficient AI/ML model transfer thanks to an overhauled model training and communication protocol. We demonstrate the communication-related benefits of Accordion, analyse key performance trade-offs, and discuss potential research directions within this realm.
    Towards Implementing Energy-aware Data-driven Intelligence for Smart Health Applications on Mobile Platforms. (arXiv:2302.00514v1 [cs.LG])
    Recent breakthrough technological progressions of powerful mobile computing resources such as low-cost mobile GPUs along with cutting-edge, open-source software architectures have enabled high-performance deep learning on mobile platforms. These advancements have revolutionized the capabilities of today's mobile applications in different dimensions to perform data-driven intelligence locally, particularly for smart health applications. Unlike traditional machine learning (ML) architectures, modern on-device deep learning frameworks are proficient in utilizing computing resources in mobile platforms seamlessly, in terms of producing highly accurate results in less inference time. However, on the flip side, energy resources in a mobile device are typically limited. Hence, whenever a complex Deep Neural Network (DNN) architecture is fed into the on-device deep learning framework, while it achieves high prediction accuracy (and performance), it also urges huge energy demands during the runtime. Therefore, managing these resources efficiently within the spectrum of performance and energy efficiency is the newest challenge for any mobile application featuring data-driven intelligence beyond experimental evaluations. In this paper, first, we provide a timely review of recent advancements in on-device deep learning while empirically evaluating the performance metrics of current state-of-the-art ML architectures and conventional ML approaches with the emphasis given on energy characteristics by deploying them on a smart health application. With that, we are introducing a new framework through an energy-aware, adaptive model comprehension and realization (EAMCR) approach that can be utilized to make more robust and efficient inference decisions based on the available computing/energy resources in the mobile device during the runtime.
    Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics. (arXiv:2302.00600v1 [cs.LG])
    Coarse-grained (CG) molecular dynamics enables the study of biological processes at temporal and spatial scales that would be intractable at an atomistic resolution. However, accurately learning a CG force field remains a challenge. In this work, we leverage connections between score-based generative models, force fields and molecular dynamics to learn a CG force field without requiring any force inputs during training. Specifically, we train a diffusion generative model on protein structures from molecular dynamics simulations, and we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small- to medium-sized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of all-atom simulations such as protein folding events.
    MB-DECTNet: A Model-Based Unrolled Network for Accurate 3D DECT Reconstruction. (arXiv:2302.00577v1 [eess.IV])
    Numerous dual-energy CT (DECT) techniques have been developed in the past few decades. Dual-energy CT (DECT) statistical iterative reconstruction (SIR) has demonstrated its potential for reducing noise and increasing accuracy. Our lab proposed a joint statistical DECT algorithm for stopping power estimation and showed that it outperforms competing image-based material-decomposition methods. However, due to its slow convergence and the high computational cost of projections, the elapsed time of 3D DECT SIR is often not clinically acceptable. Therefore, to improve its convergence, we have embedded DECT SIR into a deep learning model-based unrolled network for 3D DECT reconstruction (MB-DECTNet) that can be trained in an end-to-end fashion. This deep learning-based method is trained to learn the shortcuts between the initial conditions and the stationary points of iterative algorithms while preserving the unbiased estimation property of model-based algorithms. MB-DECTNet is formed by stacking multiple update blocks, each of which consists of a data consistency layer (DC) and a spatial mixer layer, where the spatial mixer layer is the shrunken U-Net, and the DC layer is a one-step update of an arbitrary traditional iterative method. Although the proposed network can be combined with numerous iterative DECT algorithms, we demonstrate its performance with the dual-energy alternating minimization (DEAM). The qualitative result shows that MB-DECTNet with DEAM significantly reduces noise while increasing the resolution of the test image. The quantitative result shows that MB-DECTNet has the potential to estimate attenuation coefficients accurately as traditional statistical algorithms but with a much lower computational cost.
    Distribution free optimality intervals for clustering. (arXiv:2107.14442v2 [stat.ML] UPDATED)
    We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
    Conditional Flow Matching: Simulation-Free Dynamic Optimal Transport. (arXiv:2302.00482v1 [cs.LG])
    Continuous normalizing flows (CNFs) are an attractive generative modeling technique, but they have thus far been held back by limitations in their simulation-based maximum likelihood training. In this paper, we introduce a new technique called conditional flow matching (CFM), a simulation-free training objective for CNFs. CFM features a stable regression objective like that used to train the stochastic flow in diffusion models but enjoys the efficient inference of deterministic flow models. In contrast to both diffusion models and prior CNF training algorithms, our CFM objective does not require the source distribution to be Gaussian or require evaluation of its density. Based on this new objective, we also introduce optimal transport CFM (OT-CFM), which creates simpler flows that are more stable to train and lead to faster inference, as evaluated in our experiments. Training CNFs with CFM improves results on a variety of conditional and unconditional generation tasks such as inferring single cell dynamics, unsupervised image translation, and Schr\"odinger bridge inference. Code is available at https://github.com/atong01/conditional-flow-matching .
    Optimal Learning of Deep Random Networks of Extensive-width. (arXiv:2302.00375v1 [stat.ML])
    We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We derive a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We contrast these Bayes-optimal errors with the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
    Machine Learning for Visualization Recommendation Systems: Open Challenges and Future Directions. (arXiv:2302.00569v1 [cs.LG])
    Visualization Recommendation Systems (VRS) are a novel and challenging field of study, whose aim is to automatically generate insightful visualizations from data, to support non-expert users in the process of information discovery. Despite its enormous application potential in the era of big data, progress in this area of research is being held back by several obstacles among which are the absence of standardized datasets to train recommendation algorithms, and the difficulty in defining quantitative criteria to assess the effectiveness of the generated plots. In this paper, we aim not only to summarize the state-of-the-art of VRS, but also to outline promising future research directions.
    Graph Neural Operators for Classification of Spatial Transcriptomics Data. (arXiv:2302.00658v1 [cs.LG])
    The inception of spatial transcriptomics has allowed improved comprehension of tissue architectures and the disentanglement of complex underlying biological, physiological, and pathological processes through their positional contexts. Recently, these contexts, and by extension the field, have seen much promise and elucidation with the application of graph learning approaches. In particular, neural operators have risen in regards to learning the mapping between infinite-dimensional function spaces. With basic to deep neural network architectures being data-driven, i.e. dependent on quality data for prediction, neural operators provide robustness by offering generalization among different resolutions despite low quality data. Graph neural operators are a variant that utilize graph networks to learn this mapping between function spaces. The aim of this research is to identify robust machine learning architectures that integrate spatial information to predict tissue types. Under this notion, we propose a study incorporating various graph neural network approaches to validate the efficacy of applying neural operators towards prediction of brain regions in mouse brain tissue samples as a proof of concept towards our purpose. We were able to achieve an F1 score of nearly 72% for the graph neural operator approach which outperformed all baseline and other graph network approaches.
    Experimental observation on a low-rank tensor model for eigenvalue problems. (arXiv:2302.00538v1 [cs.LG])
    Here we utilize a low-rank tensor model (LTM) as a function approximator, combined with the gradient descent method, to solve eigenvalue problems including the Laplacian operator and the harmonic oscillator. Experimental results show the superiority of the polynomial-based low-rank tensor model (PLTM) compared to the tensor neural network (TNN). We also test such low-rank architectures for the classification problem on the MNIST dataset.
    QCRS: Improve Randomized Smoothing using Quasi-Concave Optimization. (arXiv:2302.00209v1 [cs.LG])
    Randomized smoothing is currently the state-of-the-art method that provides certified robustness for deep neural networks. However, it often cannot achieve an adequate certified region on real-world datasets. One way to obtain a larger certified region is to use an input-specific algorithm instead of using a fixed Gaussian filter for all data points. Several methods based on this idea have been proposed, but they either suffer from high computational costs or gain marginal improvement in certified radius. In this work, we show that by exploiting the quasiconvex problem structure, we can find the optimal certified radii for most data points with slight computational overhead. This observation leads to an efficient and effective input-specific randomized smoothing algorithm. We conduct extensive experiments and empirical analysis on Cifar10 and ImageNet. The results show that the proposed method significantly enhances the certified radii with low computational overhead.
    GFlowNets for AI-Driven Scientific Discovery. (arXiv:2302.00615v1 [cs.LG])
    Tackling the most pressing problems for humanity, such as the climate crisis and the threat of global pandemics, requires accelerating the pace of scientific discovery. While science has traditionally relied on trial and error and even serendipity to a large extent, the last few decades have seen a surge of data-driven scientific discoveries. However, in order to truly leverage large-scale data sets and high-throughput experimental setups, machine learning methods will need to be further improved and better integrated in the scientific discovery pipeline. A key challenge for current machine learning methods in this context is the efficient exploration of very large search spaces, which requires techniques for estimating reducible (epistemic) uncertainty and generating sets of diverse and informative experiments to perform. This motivated a new probabilistic machine learning framework called GFlowNets, which can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop. GFlowNets learn to sample from a distribution given indirectly by a reward function corresponding to an unnormalized probability, which enables sampling diverse, high-reward candidates. GFlowNets can also be used to form efficient and amortized Bayesian posterior estimators for causal models conditioned on the already acquired experimental data. Having such posterior models can then provide estimators of epistemic uncertainty and information gain that can drive an experimental design policy. Altogether, here we will argue that GFlowNets can become a valuable tool for AI-driven scientific discovery, especially in scenarios of very large candidate spaces where we have access to cheap but inaccurate measurements or to expensive but accurate measurements. This is a common setting in the context of drug and material discovery, which we use as examples throughout the paper.
    Simple yet Effective Gradient-Free Graph Convolutional Networks. (arXiv:2302.00371v1 [cs.LG])
    Linearized Graph Neural Networks (GNNs) have attracted great attention in recent years for graph representation learning. Compared with nonlinear Graph Neural Network (GNN) models, linearized GNNs are much more time-efficient and can achieve comparable performances on typical downstream tasks such as node classification. Although some linearized GNN variants are purposely crafted to mitigate ``over-smoothing", empirical studies demonstrate that they still somehow suffer from this issue. In this paper, we instead relate over-smoothing with the vanishing gradient phenomenon and craft a gradient-free training framework to achieve more efficient and effective linearized GNNs which can significantly overcome over-smoothing and enhance the generalization of the model. The experimental results demonstrate that our methods achieve better and more stable performances on node classification tasks with varying depths and cost much less training time.
    Internally Rewarded Reinforcement Learning. (arXiv:2302.00270v1 [cs.LG])
    We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an untrained policy impedes discriminator learning. We call this learning setting $\textit{Internally Rewarded Reinforcement Learning}$ (IRRL) as the reward is not provided directly by the environment but $\textit{internally}$ by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
    Stream-based active learning with linear models. (arXiv:2207.09874v3 [stat.ML] UPDATED)
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
    Deep Learning Approach to Predict Hemorrhage in Moyamoya Disease. (arXiv:2302.00188v1 [cs.LG])
    Objective: Reliable tools to predict moyamoya disease (MMD) patients at risk for hemorrhage could have significant value. The aim of this paper is to develop three machine learning classification algorithms to predict hemorrhage in moyamoya disease. Methods: Clinical data of consecutive MMD patients who were admitted to our hospital between 2009 and 2015 were reviewed. Demographics, clinical, radiographic data were analyzed to develop artificial neural network (ANN), support vector machine (SVM), and random forest models. Results: We extracted 33 parameters, including 11 demographic and 22 radiographic features as input for model development. Of all compared classification results, ANN achieved the highest overall accuracy of 75.7% (95% CI, 68.6%-82.8%), followed by SVM with 69.2% (95% CI, 56.9%-81.5%) and random forest with 70.0% (95% CI, 57.0%-83.0%). Conclusions: The proposed ANN framework can be a potential effective tool to predict the possibility of hemorrhage among adult MMD patients based on clinical information and radiographic features.
    Learning from Stochastic Labels. (arXiv:2302.00299v1 [cs.LG])
    Annotating multi-class instances is a crucial task in the field of machine learning. Unfortunately, identifying the correct class label from a long sequence of candidate labels is time-consuming and laborious. To alleviate this problem, we design a novel labeling mechanism called stochastic label. In this setting, stochastic label includes two cases: 1) identify a correct class label from a small number of randomly given labels; 2) annotate the instance with None label when given labels do not contain correct class label. In this paper, we propose a novel suitable approach to learn from these stochastic labels. We obtain an unbiased estimator that utilizes less supervised information in stochastic labels to train a multi-class classifier. Additionally, it is theoretically justifiable by deriving the estimation error bound of the proposed method. Finally, we conduct extensive experiments on widely-used benchmark datasets to validate the superiority of our method by comparing it with existing state-of-the-art methods.
    Learning Choice Functions with Gaussian Processes. (arXiv:2302.00406v1 [cs.LG])
    In consumer theory, ranking available objects by means of preference relations yields the most common description of individual choices. However, preference-based models assume that individuals: (1) give their preferences only between pairs of objects; (2) are always able to pick the best preferred object. In many situations, they may be instead choosing out of a set with more than two elements and, because of lack of information and/or incomparability (objects with contradictory characteristics), they may not able to select a single most preferred object. To address these situations, we need a choice-model which allows an individual to express a set-valued choice. Choice functions provide such a mathematical framework. We propose a Gaussian Process model to learn choice functions from choice-data. The proposed model assumes a multiple utility representation of a choice function based on the concept of Pareto rationalization, and derives a strategy to learn both the number and the values of these latent multiple utilities. Simulation experiments demonstrate that the proposed model outperforms the state-of-the-art methods.
    Local convexity of the TAP free energy and AMP convergence for Z2-synchronization. (arXiv:2106.11428v2 [math.ST] UPDATED)
    We study mean-field variational Bayesian inference using the TAP approach, for Z2-synchronization as a prototypical example of a high-dimensional Bayesian model. We show that for any signal strength $\lambda > 1$ (the weak-recovery threshold), there exists a unique local minimizer of the TAP free energy functional near the mean of the Bayes posterior law. Furthermore, the TAP free energy in a local neighborhood of this minimizer is strongly convex. Consequently, a natural-gradient/mirror-descent algorithm achieves linear convergence to this minimizer from a local initialization, which may be obtained by a constant number of iterates of Approximate Message Passing (AMP). This provides a rigorous foundation for variational inference in high dimensions via minimization of the TAP free energy. We also analyze the finite-sample convergence of AMP, showing that AMP is asymptotically stable at the TAP minimizer for any $\lambda > 1$, and is linearly convergent to this minimizer from a spectral initialization for sufficiently large $\lambda$. Such a guarantee is stronger than results obtainable by state evolution analyses, which only describe a fixed number of AMP iterations in the infinite-sample limit. Our proofs combine the Kac-Rice formula and Sudakov-Fernique Gaussian comparison inequality to analyze the complexity of critical points that satisfy strong convexity and stability conditions within their local neighborhoods.
    CATFL: Certificateless Authentication-based Trustworthy Federated Learning for 6G Semantic Communications. (arXiv:2302.00271v1 [cs.CR])
    Federated learning (FL) provides an emerging approach for collaboratively training semantic encoder/decoder models of semantic communication systems, without private user data leaving the devices. Most existing studies on trustworthy FL aim to eliminate data poisoning threats that are produced by malicious clients, but in many cases, eliminating model poisoning attacks brought by fake servers is also an important objective. In this paper, a certificateless authentication-based trustworthy federated learning (CATFL) framework is proposed, which mutually authenticates the identity of clients and server. In CATFL, each client verifies the server's signature information before accepting the delivered global model to ensure that the global model is not delivered by false servers. On the contrary, the server also verifies the server's signature information before accepting the delivered model updates to ensure that they are submitted by authorized clients. Compared to PKI-based methods, the CATFL can avoid too high certificate management overheads. Meanwhile, the anonymity of clients shields data poisoning attacks, while real-name registration may suffer from user-specific privacy leakage risks. Therefore, a pseudonym generation strategy is also presented in CATFL to achieve a trade-off between identity traceability and user anonymity, which is essential to conditionally prevent from user-specific privacy leakage. Theoretical security analysis and evaluation results validate the superiority of CATFL.
    Bandit Convex Optimisation Revisited: FTRL Achieves $\tilde{O}(t^{1/2})$ Regret. (arXiv:2302.00358v1 [cs.LG])
    We show that a kernel estimator using multiple function evaluations can be easily converted into a sampling-based bandit estimator with expectation equal to the original kernel estimate. Plugging such a bandit estimator into the standard FTRL algorithm yields a bandit convex optimisation algorithm that achieves $\tilde{O}(t^{1/2})$ regret against adversarial time-varying convex loss functions.
    Equivariant Message Passing Neural Network for Crystal Material Discovery. (arXiv:2302.00485v1 [cs.LG])
    Automatic material discovery with desired properties is a fundamental challenge for material sciences. Considerable attention has recently been devoted to generating stable crystal structures. While existing work has shown impressive success on supervised tasks such as property prediction, the progress on unsupervised tasks such as material generation is still hampered by the limited extent to which the equivalent geometric representations of the same crystal are considered. To address this challenge, we propose EMPNN a periodic equivariant message-passing neural network that learns crystal lattice deformation in an unsupervised fashion. Our model equivalently acts on lattice according to the deformation action that must be performed, making it suitable for crystal generation, relaxation and optimisation. We present experimental evaluations that demonstrate the effectiveness of our approach.
    Learning Prototype Classifiers for Long-Tailed Recognition. (arXiv:2302.00491v1 [cs.CV])
    The problem of long-tailed recognition (LTR) has received attention in recent years due to the fundamental power-law distribution of objects in the real-world. Most recent works in LTR use softmax classifiers that have a tendency to correlate classifier norm with the amount of training data for a given class. On the other hand, Prototype classifiers do not suffer from this shortcoming and can deliver promising results simply using Nearest-Class-Mean (NCM), a special case where prototypes are empirical centroids. However, the potential of Prototype classifiers as an alternative to softmax in LTR is relatively underexplored. In this work, we propose Prototype classifiers, which jointly learn prototypes that minimize average cross-entropy loss based on probability scores from distances to prototypes. We theoretically analyze the properties of Euclidean distance based prototype classifiers that leads to stable gradient-based optimization which is robust to outliers. We further enhance Prototype classifiers by learning channel-dependent temperature parameters to enable independent distance scales along each channel. Our analysis shows that prototypes learned by Prototype classifiers are better separated than empirical centroids. Results on four long-tailed recognition benchmarks show that Prototype classifier outperforms or is comparable to the state-of-the-art methods.
    Robust online active learning. (arXiv:2302.00422v1 [stat.ML])
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.
    Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders. (arXiv:2302.00662v1 [stat.ML])
    Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
    A Survey of Methods, Challenges and Perspectives in Causality. (arXiv:2302.00293v1 [cs.LG])
    The Causality field aims to find systematic methods for uncovering cause-effect relationships. Such methods can find applications in many research fields, justifying a great interest in this domain. Machine Learning models have shown success in a large variety of tasks by extracting correlation patterns from high-dimensional data but still struggle when generalizing out of their initial distribution. As causal engines aim to learn mechanisms that are independent from a data distribution, combining Machine Learning with Causality has the potential to bring benefits to the two fields. In our work, we motivate this assumption and provide applications. We first perform an extensive overview of the theories and methods for Causality from different perspectives. We then provide a deeper look at the connections between Causality and Machine Learning and describe the challenges met by the two domains. We show the early attempts to bring the fields together and the possible perspectives for the future. We finish by providing a large variety of applications for techniques from Causality.
    Quickest Change Detection for Unnormalized Statistical Models. (arXiv:2302.00250v1 [stat.ML])
    Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.
    Learning Cut Selection for Mixed-Integer Linear Programming via Hierarchical Sequence Model. (arXiv:2302.00244v1 [cs.LG])
    Cutting planes (cuts) are important for solving mixed-integer linear programs (MILPs), which formulate a wide range of important real-world applications. Cut selection -- which aims to select a proper subset of the candidate cuts to improve the efficiency of solving MILPs -- heavily depends on (P1) which cuts should be preferred, and (P2) how many cuts should be selected. Although many modern MILP solvers tackle (P1)-(P2) by manually designed heuristics, machine learning offers a promising approach to learn more effective heuristics from MILPs collected from specific applications. However, many existing learning-based methods focus on learning which cuts should be preferred, neglecting the importance of learning the number of cuts that should be selected. Moreover, we observe from extensive empirical results that (P3) what order of selected cuts should be preferred has a significant impact on the efficiency of solving MILPs as well. To address this challenge, we propose a novel hierarchical sequence model (HEM) to learn cut selection policies via reinforcement learning. Specifically, HEM consists of a two-level model: (1) a higher-level model to learn the number of cuts that should be selected, (2) and a lower-level model -- that formulates the cut selection task as a sequence to sequence learning problem -- to learn policies selecting an ordered subset with the size determined by the higher-level model. To the best of our knowledge, HEM is the first method that can tackle (P1)-(P3) in cut selection simultaneously from a data-driven perspective. Experiments show that HEM significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs, including MIPLIB 2017. Moreover, experiments demonstrate that HEM well generalizes to MILPs that are significantly larger than those seen during training.
    Fourier series weight in quantum machine learning. (arXiv:2302.00105v1 [quant-ph])
    In this work, we aim to confirm the impact of the Fourier series on the quantum machine learning model. We will propose models, tests, and demonstrations to achieve this objective. We designed a quantum machine learning leveraged on the Hamiltonian encoding. With a subtle change, we performed the trigonometric interpolation, binary and multiclass classifier, and a quantum signal processing application. We also proposed a block diagram of determining approximately the Fourier coefficient based on quantum machine learning. We performed and tested all the proposed models using the Pennylane framework.
    Local transfer learning from one data space to another. (arXiv:2302.00160v1 [cs.LG])
    A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.
    Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. (arXiv:2211.07092v3 [stat.ML] UPDATED)
    In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
    Efficient Scopeformer: Towards Scalable and Rich Feature Extraction for Intracranial Hemorrhage Detection. (arXiv:2302.00220v1 [cs.CV])
    The quality and richness of feature maps extracted by convolution neural networks (CNNs) and vision Transformers (ViTs) directly relate to the robust model performance. In medical computer vision, these information-rich features are crucial for detecting rare cases within large datasets. This work presents the "Scopeformer," a novel multi-CNN-ViT model for intracranial hemorrhage classification in computed tomography (CT) images. The Scopeformer architecture is scalable and modular, which allows utilizing various CNN architectures as the backbone with diversified output features and pre-training strategies. We propose effective feature projection methods to reduce redundancies among CNN-generated features and to control the input size of ViTs. Extensive experiments with various Scopeformer models show that the model performance is proportional to the number of convolutional blocks employed in the feature extractor. Using multiple strategies, including diversifying the pre-training paradigms for CNNs, different pre-training datasets, and style transfer techniques, we demonstrate an overall improvement in the model performance at various computational budgets. Later, we propose smaller compute-efficient Scopeformer versions with three different types of input and output ViT configurations. Efficient Scopeformers use four different pre-trained CNN architectures as feature extractors to increase feature richness. Our best Efficient Scopeformer model achieved an accuracy of 96.94\% and a weighted logarithmic loss of 0.083 with an eight times reduction in the number of trainable parameters compared to the base Scopeformer. Another version of the Efficient Scopeformer model further reduced the parameter space by almost 17 times with negligible performance reduction. Hybrid CNNs and ViTs might provide the desired feature richness for developing accurate medical computer vision models
    Accelerated First-Order Optimization under Nonlinear Constraints. (arXiv:2302.00316v1 [math.OC])
    We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive rates for the convex setting. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.
    Graph-based Time-Series Anomaly Detection: A Survey. (arXiv:2302.00058v1 [cs.LG])
    With the recent advances in technology, a wide range of systems continues to collect a large amount of data over time and thus generating time series. Detecting anomalies in time series data is an important task in various applications such as e-commerce, cybersecurity, and health care monitoring. However, Time-series Anomaly Detection (TSAD) is very challenging as it requires considering both the temporal dependency and the structural dependency. Recent graph-based approaches have made impressive progress in tackling the challenges of this field. In this survey, we conduct a comprehensive and up-to-date review of Graph-based Time-series Anomaly Detection (G-TSAD). First, we explore the significant potential of graph-based methods in identifying different types of anomalies in time series data. Then, we provide a structured and comprehensive review of the state-of-the-art graph anomaly detection techniques in the context of time series. Finally, we discuss the technical challenges and potential future directions for possible improvements in this research field.
    OrthoReg: Improving Graph-regularized MLPs via Orthogonality Regularization. (arXiv:2302.00109v1 [cs.LG])
    Graph Neural Networks (GNNs) are currently dominating in modeling graph-structure data, while their high reliance on graph structure for inference significantly impedes them from widespread applications. By contrast, Graph-regularized MLPs (GR-MLPs) implicitly inject the graph structure information into model weights, while their performance can hardly match that of GNNs in most tasks. This motivates us to study the causes of the limited performance of GR-MLPs. In this paper, we first demonstrate that node embeddings learned from conventional GR-MLPs suffer from dimensional collapse, a phenomenon in which the largest a few eigenvalues dominate the embedding space, through empirical observations and theoretical analysis. As a result, the expressive power of the learned node representations is constrained. We further propose OrthoReg, a novel GR-MLP model to mitigate the dimensional collapse issue. Through a soft regularization loss on the correlation matrix of node embeddings, OrthoReg explicitly encourages orthogonal node representations and thus can naturally avoid dimensionally collapsed representations. Experiments on traditional transductive semi-supervised classification tasks and inductive node classification for cold-start scenarios demonstrate its effectiveness and superiority.
    TwinExplainer: Explaining Predictions of an Automotive Digital Twin. (arXiv:2302.00152v1 [cs.LG])
    Vehicles are complex Cyber Physical Systems (CPS) that operate in a variety of environments, and the likelihood of failure of one or more subsystems, such as the engine, transmission, brakes, and fuel, can result in unscheduled downtime and incur high maintenance or repair costs. In order to prevent these issues, it is crucial to continuously monitor the health of various subsystems and identify abnormal sensor channel behavior. Data-driven Digital Twin (DT) systems are capable of such a task. Current DT technologies utilize various Deep Learning (DL) techniques that are constrained by the lack of justification or explanation for their predictions. This inability of these opaque systems can influence decision-making and raises user trust concerns. This paper presents a solution to this issue, where the TwinExplainer system, with its three-layered architectural pipeline, explains the predictions of an automotive DT. Such a system can assist automotive stakeholders in understanding the global scale of the sensor channels and how they contribute towards generic DT predictions. TwinExplainer can also visualize explanations for both normal and abnormal local predictions computed by the DT.
    Neural Control of Parametric Solutions for High-dimensional Evolution PDEs. (arXiv:2302.00045v1 [math.NA])
    We develop a novel computational framework to approximate solution operators of evolution partial differential equations (PDEs). By employing a general nonlinear reduced-order model, such as a deep neural network, to approximate the solution of a given PDE, we realize that the evolution of the model parameter is a control problem in the parameter space. Based on this observation, we propose to approximate the solution operator of the PDE by learning the control vector field in the parameter space. From any initial value, this control field can steer the parameter to generate a trajectory such that the corresponding reduced-order model solves the PDE. This allows for substantially reduced computational cost to solve the evolution PDE with arbitrary initial conditions. We also develop comprehensive error analysis for the proposed method when solving a large class of semilinear parabolic PDEs. Numerical experiments on different high-dimensional evolution PDEs with various initial conditions demonstrate the promising results of the proposed method.
    Detection of Tomato Ripening Stages using Yolov3-tiny. (arXiv:2302.00164v1 [cs.CV])
    One of the most important agricultural products in Mexico is the tomato (Solanum lycopersicum), which occupies the 4th place national most produced product . Therefore, it is necessary to improve its production, building automatic detection system that detect, classify an keep tacks of the fruits is one way to archieve it. So, in this paper, we address the design of a computer vision system to detect tomatoes at different ripening stages. To solve the problem, we use a neural network-based model for tomato classification and detection. Specifically, we use the YOLOv3-tiny model because it is one of the lightest current deep neural networks. To train it, we perform two grid searches testing several combinations of hyperparameters. Our experiments showed an f1-score of 90.0% in the localization and classification of ripening stages in a custom dataset.
    FI-ODE: Certified and Robust Forward Invariance in Neural ODEs. (arXiv:2210.16940v2 [cs.LG] UPDATED)
    Forward invariance is a long-studied property in control theory that is used to certify that a dynamical system stays within some pre-specified set of states for all time, and also admits robustness guarantees (e.g., the certificate holds under perturbations). We propose a general framework for training and provably certifying robust forward invariance in Neural ODEs. We apply this framework in two settings: certified adversarial robustness for image classification, and certified safety in continuous control. Notably, our method empirically produces superior adversarial robustness guarantees compared to prior work on certifiably robust Neural ODEs (including implicit-depth models).
    Stable Target Field for Reduced Variance Score Estimation in Diffusion Models. (arXiv:2302.00670v1 [cs.LG])
    Diffusion models generate samples by reversing a fixed forward diffusion process. Despite already providing impressive empirical results, these diffusion models algorithms can be further improved by reducing the variance of the training targets in their denoising score-matching objective. We argue that the source of such variance lies in the handling of intermediate noise-variance scales, where multiple modes in the data affect the direction of reverse paths. We propose to remedy the problem by incorporating a reference batch which we use to calculate weighted conditional scores as more stable training targets. We show that the procedure indeed helps in the challenging intermediate regime by reducing (the trace of) the covariance of training targets. The new stable targets can be seen as trading bias for reduced variance, where the bias vanishes with increasing reference batch size. Empirically, we show that the new objective improves the image quality, stability, and training speed of various popular diffusion models across datasets with both general ODE and SDE solvers. When used in combination with EDM, our method yields a current SOTA FID of 1.90 with 35 network evaluations on the unconditional CIFAR-10 generation task. The code is available at https://github.com/Newbeeer/stf
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v2 [stat.ML] UPDATED)
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v2 [cond-mat.stat-mech] UPDATED)
    Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have the advantages of a principle understanding of the expressive power using entanglement properties, and as a bridge connecting classical computation and quantum computation. Despite the great potential, however, existing tensor network models for unsupervised machine learning only work as a proof of principle, as their performance is much worse than the standard models such as restricted Boltzmann machines and neural networks. In this Letter, we present autoregressive matrix product states (AMPS), a tensor network model combining matrix product states from quantum many-body physics and autoregressive modeling from machine learning. Our model enjoys the exact calculation of normalized probability and unbiased sampling. We demonstrate the performance of our model using two applications, generative modeling on synthetic and real-world data, and reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor network models and the restricted Boltzmann machines, and is competitive with state-of-the-art neural network models.
    How Out-of-Distribution Data Hurts Semi-Supervised Learning. (arXiv:2010.03658v3 [cs.LG] UPDATED)
    Recent semi-supervised learning algorithms have demonstrated greater success with higher overall performance due to better-unlabeled data representations. Nonetheless, recent research suggests that the performance of the SSL algorithm can be degraded when the unlabeled set contains out-of-distribution examples (OODs). This work addresses the following question: How do out-of-distribution (OOD) data adversely affect semi-supervised learning algorithms? To answer this question, we investigate the critical causes of OOD's negative effect on SSL algorithms. In particular, we found that 1) certain kinds of OOD data instances that are close to the decision boundary have a more significant impact on performance than those that are further away, and 2) Batch Normalization (BN), a popular module, may degrade rather than improve performance when the unlabeled set contains OODs. In this context, we developed a unified weighted robust SSL framework that can be easily extended to many existing SSL algorithms and improve their robustness against OODs. More specifically, we developed an efficient bi-level optimization algorithm that could accommodate high-order approximations of the objective and scale to multiple inner optimization steps to learn a massive number of weight parameters while outperforming existing low-order approximations of bi-level optimization. Further, we conduct a theoretical study of the impact of faraway OODs in the BN step and propose a weighted batch normalization (WBN) procedure for improved performance. Finally, we discuss the connection between our approach and low-order approximation techniques. Our experiments on synthetic and real-world datasets demonstrate that our proposed approach significantly enhances the robustness of four representative SSL algorithms against OODs compared to four state-of-the-art robust SSL strategies.
    Variational Causal Inference. (arXiv:2209.05935v2 [stat.ML] UPDATED)
    Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
    Efficient Meta-Learning via Error-based Context Pruning for Implicit Neural Representations. (arXiv:2302.00617v1 [cs.LG])
    We introduce an efficient optimization-based meta-learning technique for learning large-scale implicit neural representations (INRs). Our main idea is designing an online selection of context points, which can significantly reduce memory requirements for meta-learning in any established setting. By doing so, we expect additional memory savings which allows longer per-signal adaptation horizons (at a given memory budget), leading to better meta-initializations by reducing myopia and, more crucially, enabling learning on high-dimensional signals. To implement such context pruning, our technical novelty is three-fold. First, we propose a selection scheme that adaptively chooses a subset at each adaptation step based on the predictive error, leading to the modeling of the global structure of the signal in early steps and enabling the later steps to capture its high-frequency details. Second, we counteract any possible information loss from context pruning by minimizing the parameter distance to a bootstrapped target model trained on a full context set. Finally, we suggest using the full context set with a gradient scaling scheme at test-time. Our technique is model-agnostic, intuitive, and straightforward to implement, showing significant reconstruction improvements for a wide range of signals. Code is available at https://github.com/jihoontack/ECoP
    A Fair Empirical Risk Minimization with Generalized Entropy. (arXiv:2202.11966v3 [cs.LG] UPDATED)
    This paper studies a parametric family of algorithmic fairness metrics, called generalized entropy, which originally has been used in public welfare and recently introduced to machine learning community. As a meaningful metric to evaluate algorithmic fairness, it requires that generalized entropy specify fairness requirements of a classification problem and the fairness requirements should be realized with small deviation by an algorithm. We investigate the role of generalized entropy as a design parameter for fair classification algorithm through a fair empirical risk minimization with a constraint specified in terms of generalized entropy. We theoretically and experimentally study learnability of the problem.
    Anderson Acceleration For Bioinformatics-Based Machine Learning. (arXiv:2302.00347v1 [cs.LG])
    Anderson acceleration (AA) is a well-known method for accelerating the convergence of iterative algorithms, with applications in various fields including deep learning and optimization. Despite its popularity in these areas, the effectiveness of AA in classical machine learning classifiers has not been thoroughly studied. Tabular data, in particular, presents a unique challenge for deep learning models, and classical machine learning models are known to perform better in these scenarios. However, the convergence analysis of these models has received limited attention. To address this gap in research, we implement a support vector machine (SVM) classifier variant that incorporates AA to speed up convergence. We evaluate the performance of our SVM with and without Anderson acceleration on several datasets from the biology domain and demonstrate that the use of AA significantly improves convergence and reduces the training loss as the number of iterations increases. Our findings provide a promising perspective on the potential of Anderson acceleration in the training of simple machine learning classifiers and underscore the importance of further research in this area. By showing the effectiveness of AA in this setting, we aim to inspire more studies that explore the applications of AA in classical machine learning.
    Pessimistic Off-Policy Optimization for Learning to Rank. (arXiv:2206.02593v3 [cs.LG] UPDATED)
    Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient and we analyze it. We study its Bayesian and frequentist variants, and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines, is robust, and is also general.
    HCR-Net: A deep learning based script independent handwritten character recognition network. (arXiv:2108.06663v3 [cs.CV] UPDATED)
    Despite being studied extensively for a few decades, handwritten character recognition (HCR) is still considered a challenging learning problem in pattern recognition, and there is very limited research on script independent models. This is mainly because of similarity in structure of characters, different handwriting styles, noisy datasets, diversity of scripts, focus of the conventional research on handcrafted feature extraction techniques, and unavailability of public datasets and code-repositories to reproduce the results. On the other hand, deep learning has witnessed huge success in different areas of pattern recognition, including HCR, and provides an end-to-end learning. However, deep learning techniques are computationally expensive, need large amount of data for training and have been developed for specific scripts only. To address the above limitations, we have proposed a novel generic deep learning architecture for script independent handwritten character recognition, called HCR-Net. HCR-Net is based on a novel transfer learning approach for HCR, which partly utilizes feature extraction layers of a pre-trained network. Due to transfer learning and image-augmentation, HCR-Net provides faster and computationally efficient training, better performance and better generalizations, and can work with small datasets. HCR-Net is extensively evaluated on 40 publicly available datasets of Bangla, Punjabi, Hindi, English, Swedish, Urdu, Farsi, Tibetan, Kannada, Malayalam, Telugu, Marathi, Nepali and Arabic languages, and established 26 new benchmark results while performed close to the best results in the rest cases. HCR-Net showed performance improvements up to 11% against the existing results and achieved a fast convergence rate showing up to 99% of final performance in the very first epoch. HCR-Net significantly outperformed the state-of-the-art transfer learning techniques...
    Diffusion Models for High-Resolution Solar Forecasts. (arXiv:2302.00170v1 [cs.LG])
    Forecasting future weather and climate is inherently difficult. Machine learning offers new approaches to increase the accuracy and computational efficiency of forecasts, but current methods are unable to accurately model uncertainty in high-dimensional predictions. Score-based diffusion models offer a new approach to modeling probability distributions over many dependent variables, and in this work, we demonstrate how they provide probabilistic forecasts of weather and climate variables at unprecedented resolution, speed, and accuracy. We apply the technique to day-ahead solar irradiance forecasts by generating many samples from a diffusion model trained to super-resolve coarse-resolution numerical weather predictions to high-resolution weather satellite observations.
    Deep Power Laws for Hyperparameter Optimization. (arXiv:2302.00441v1 [cs.LG])
    Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of hyperparameter optimization, however, most of the methods do not exploit the scaling law property of learning curves. In this work, we propose Deep Power Laws (DPL), an ensemble of neural network models conditioned to yield predictions that follow a power-law scaling pattern. Our method dynamically decides which configurations to pause and train incrementally by making use of gray-box evaluations. We compare our method against 7 state-of-the-art competitors on 3 benchmarks related to tabular, image, and NLP datasets covering 57 diverse tasks. Our method achieves the best results across all benchmarks by obtaining the best any-time results compared to all competitors.
    Analyzing Leakage of Personally Identifiable Information in Language Models. (arXiv:2302.00539v1 [cs.LG])
    Language Models (LMs) have been shown to leak information about training data through sentence-level membership inference and reconstruction attacks. Understanding the risk of LMs leaking Personally Identifiable Information (PII) has received less attention, which can be attributed to the false assumption that dataset curation techniques such as scrubbing are sufficient to prevent PII leakage. Scrubbing techniques reduce but do not prevent the risk of PII leakage: in practice scrubbing is imperfect and must balance the trade-off between minimizing disclosure and preserving the utility of the dataset. On the other hand, it is unclear to which extent algorithmic defenses such as differential privacy, designed to guarantee sentence- or user-level privacy, prevent PII disclosure. In this work, we propose (i) a taxonomy of PII leakage in LMs, (ii) metrics to quantify PII leakage, and (iii) attacks showing that PII leakage is a threat in practice. Our taxonomy provides rigorous game-based definitions for PII leakage via black-box extraction, inference, and reconstruction attacks with only API access to an LM. We empirically evaluate attacks against GPT-2 models fine-tuned on three domains: case law, health care, and e-mails. Our main contributions are (i) novel attacks that can extract up to 10 times more PII sequences as existing attacks, (ii) showing that sentence-level differential privacy reduces the risk of PII disclosure but still leaks about 3% of PII sequences, and (iii) a subtle connection between record-level membership inference and PII reconstruction.
    Efficient Multi-Task Reinforcement Learning via Selective Behavior Sharing. (arXiv:2302.00671v1 [cs.LG])
    The ability to leverage shared behaviors between tasks is critical for sample-efficient multi-task reinforcement learning (MTRL). While prior methods have primarily explored parameter and data sharing, direct behavior-sharing has been limited to task families requiring similar behaviors. Our goal is to extend the efficacy of behavior-sharing to more general task families that could require a mix of shareable and conflicting behaviors. Our key insight is an agent's behavior across tasks can be used for mutually beneficial exploration. To this end, we propose a simple MTRL framework for identifying shareable behaviors over tasks and incorporating them to guide exploration. We empirically demonstrate how behavior sharing improves sample efficiency and final performance on manipulation and navigation MTRL tasks and is even complementary to parameter sharing. Result videos are available at https://sites.google.com/view/qmp-mtrl.
    The Power of External Memory in Increasing Predictive Model Capacity. (arXiv:2302.00003v1 [cs.LG])
    One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.
    Model-Parallel Fourier Neural Operators as Learned Surrogates for Large-Scale Parametric PDEs. (arXiv:2204.01205v3 [cs.LG] UPDATED)
    Fourier neural operators (FNOs) are a recently introduced neural network architecture for learning solution operators of partial differential equations (PDEs), which have been shown to perform significantly better than comparable deep learning approaches. Once trained, FNOs can achieve speed-ups of multiple orders of magnitude over conventional numerical PDE solvers. However, due to the high dimensionality of their input data and network weights, FNOs have so far only been applied to two-dimensional or small three-dimensional problems. To remove this limited problem-size barrier, we propose a model-parallel version of FNOs based on domain-decomposition of both the input data and network weights. We demonstrate that our model-parallel FNO is able to predict time-varying PDE solutions of over 2.6 billion variables on Perlmutter using up to 512 A100 GPUs and show an example of training a distributed FNO on the Azure cloud for simulating multiphase CO$_2$ dynamics in the Earth's subsurface.
    Deep Dependency Networks for Multi-Label Classification. (arXiv:2302.00633v1 [cs.LG])
    We propose a simple approach which combines the strengths of probabilistic graphical models and deep learning architectures for solving the multi-label classification task, focusing specifically on image and video data. First, we show that the performance of previous approaches that combine Markov Random Fields with neural networks can be modestly improved by leveraging more powerful methods such as iterative join graph propagation, integer linear programming, and $\ell_1$ regularization-based structure learning. Then we propose a new modeling framework called deep dependency networks, which augments a dependency network, a model that is easy to train and learns more accurate dependencies but is limited to Gibbs sampling for inference, to the output layer of a neural network. We show that despite its simplicity, jointly learning this new architecture yields significant improvements in performance over the baseline neural network. In particular, our experimental evaluation on three video activity classification datasets: Charades, Textually Annotated Cooking Scenes (TACoS), and Wetlab, and three multi-label image classification datasets: MS-COCO, PASCAL VOC, and NUS-WIDE show that deep dependency networks are almost always superior to pure neural architectures that do not use dependency networks.
    Test-Time Amendment with a Coarse Classifier for Fine-Grained Classification. (arXiv:2302.00368v1 [cs.CV])
    We investigate the problem of reducing mistake severity for fine-grained classification. Fine-grained classification can be challenging, mainly due to the requirement of knowledge or domain expertise for accurate annotation. However, humans are particularly adept at performing coarse classification as it requires relatively low levels of expertise. To this end, we present a novel approach for Post-Hoc Correction called Hierarchical Ensembles (HiE) that utilizes label hierarchy to improve the performance of fine-grained classification at test-time using the coarse-grained predictions. By only requiring the parents of leaf nodes, our method significantly reduces avg. mistake severity while improving top-1 accuracy on the iNaturalist-19 and tieredImageNet-H datasets, achieving a new state-of-the-art on both benchmarks. We also investigate the efficacy of our approach in the semi-supervised setting. Our approach brings notable gains in top-1 accuracy while significantly decreasing the severity of mistakes as training data decreases for the fine-grained classes. The simplicity and post-hoc nature of HiE render it practical to be used with any off-the-shelf trained model to improve its predictions further.
    Generative Adversarial Symmetry Discovery. (arXiv:2302.00236v1 [cs.LG])
    Despite the success of equivariant neural networks in scientific applications, they require knowing the symmetry group a priori. However, it may be difficult to know the right symmetry to use as an inductive bias in practice and enforcing the wrong symmetry could hurt the performance. In this paper, we propose a framework, LieGAN, to automatically discover equivariances from a dataset using a paradigm akin to generative adversarial training. Specifically, a generator learns a group of transformations applied to the data, which preserves the original distribution and fools the discriminator. LieGAN represents symmetry as interpretable Lie algebra basis and can discover various symmetries such as rotation group $\mathrm{SO}(n)$ and restricted Lorentz group $\mathrm{SO}(1,3)^+$ in trajectory prediction and top quark tagging tasks. The learned symmetry can also be readily used in several existing equivariant neural networks to improve accuracy and generalization in prediction.
    Detecting Harmful Agendas in News Articles. (arXiv:2302.00102v1 [cs.CL])
    Manipulated news online is a growing problem which necessitates the use of automated systems to curtail its spread. We argue that while misinformation and disinformation detection have been studied, there has been a lack of investment in the important open challenge of detecting harmful agendas in news articles; identifying harmful agendas is critical to flag news campaigns with the greatest potential for real world harm. Moreover, due to real concerns around censorship, harmful agenda detectors must be interpretable to be effective. In this work, we propose this new task and release a dataset, NewsAgendas, of annotated news articles for agenda identification. We show how interpretable systems can be effective on this task and demonstrate that they can perform comparably to black-box models.
    Free Lunch for Domain Adversarial Training: Environment Label Smoothing. (arXiv:2302.00194v1 [cs.LG])
    A fundamental challenge for machine learning models is how to generalize learned models for out-of-distribution (OOD) data. Among various approaches, exploiting invariant features by Domain Adversarial Training (DAT) received widespread attention. Despite its success, we observe training instability from DAT, mostly due to over-confident domain discriminator and environment label noise. To address this issue, we proposed Environment Label Smoothing (ELS), which encourages the discriminator to output soft probability, which thus reduces the confidence of the discriminator and alleviates the impact of noisy environment labels. We demonstrate, both experimentally and theoretically, that ELS can improve training stability, local convergence, and robustness to noisy environment labels. By incorporating ELS with DAT methods, we are able to yield state-of-art results on a wide range of domain generalization/adaptation tasks, particularly when the environment labels are highly noisy.
    The Parametric Stability of Well-separated Spherical Gaussian Mixtures. (arXiv:2302.00242v1 [stat.ML])
    We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.
    Knowledge Distillation on Graphs: A Survey. (arXiv:2302.00219v1 [cs.LG])
    Graph Neural Networks (GNNs) have attracted tremendous attention by demonstrating their capability to handle graph data. However, they are difficult to be deployed in resource-limited devices due to model sizes and scalability constraints imposed by the multi-hop data dependency. In addition, real-world graphs usually possess complex structural information and features. Therefore, to improve the applicability of GNNs and fully encode the complicated topological information, knowledge distillation on graphs (KDG) has been introduced to build a smaller yet effective model and exploit more knowledge from data, leading to model compression and performance improvement. Recently, KDG has achieved considerable progress with many studies proposed. In this survey, we systematically review these works. Specifically, we first introduce KDG challenges and bases, then categorize and summarize existing works of KDG by answering the following three questions: 1) what to distillate, 2) who to whom, and 3) how to distillate. Finally, we share our thoughts on future research directions.
    Towards Label-Efficient Incremental Learning: A Survey. (arXiv:2302.00353v1 [cs.LG])
    The current dominant paradigm when building a machine learning model is to iterate over a dataset over and over until convergence. Such an approach is non-incremental, as it assumes access to all images of all categories at once. However, for many applications, non-incremental learning is unrealistic. To that end, researchers study incremental learning, where a learner is required to adapt to an incoming stream of data with a varying distribution while preventing forgetting of past knowledge. Significant progress has been made, however, the vast majority of works focus on the fully supervised setting, making these algorithms label-hungry thus limiting their real-life deployment. To that end, in this paper, we make the first attempt to survey recently growing interest in label-efficient incremental learning. We identify three subdivisions, namely semi-, few-shot- and self-supervised learning to reduce labeling efforts. Finally, we identify novel directions that can further enhance label-efficiency and improve incremental learning scalability. Project website: {https://github.com/kilickaya/label-efficient-il.
    Density peak clustering using tensor network. (arXiv:2302.00192v1 [cs.LG])
    Tensor networks, which have been traditionally used to simulate many-body physics, have recently gained significant attention in the field of machine learning due to their powerful representation capabilities. In this work, we propose a density-based clustering algorithm inspired by tensor networks. We encode classical data into tensor network states on an extended Hilbert space and train the tensor network states to capture the features of the clusters. Here, we define density and related concepts in terms of fidelity, rather than using a classical distance measure. We evaluate the performance of our algorithm on six synthetic data sets, four real world data sets, and three commonly used computer vision data sets. The results demonstrate that our method provides state-of-the-art performance on several synthetic data sets and real world data sets, even when the number of clusters is unknown. Additionally, our algorithm performs competitively with state-of-the-art algorithms on the MNIST, USPS, and Fashion-MNIST image data sets. These findings reveal the great potential of tensor networks for machine learning applications.
    Bridging Physics-Informed Neural Networks with Reinforcement Learning: Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO). (arXiv:2302.00237v1 [cs.LG])
    This paper introduces the Hamilton-Jacobi-Bellman Proximal Policy Optimization (HJBPPO) algorithm into reinforcement learning. The Hamilton-Jacobi-Bellman (HJB) equation is used in control theory to evaluate the optimality of the value function. Our work combines the HJB equation with reinforcement learning in continuous state and action spaces to improve the training of the value network. We treat the value network as a Physics-Informed Neural Network (PINN) to solve for the HJB equation by computing its derivatives with respect to its inputs exactly. The Proximal Policy Optimization (PPO)-Clipped algorithm is improvised with this implementation as it uses a value network to compute the objective function for its policy network. The HJBPPO algorithm shows an improved performance compared to PPO on the MuJoCo environments.
    Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression. (arXiv:2302.00257v1 [cs.LG])
    In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.
    W2SAT: Learning to generate SAT instances from Weighted Literal Incidence Graphs. (arXiv:2302.00272v1 [cs.LG])
    The Boolean Satisfiability (SAT) problem stands out as an attractive NP-complete problem in theoretic computer science and plays a central role in a broad spectrum of computing-related applications. Exploiting and tuning SAT solvers under numerous scenarios require massive high-quality industry-level SAT instances, which unfortunately are quite limited in the real world. To address the data insufficiency issue, in this paper, we propose W2SAT, a framework to generate SAT formulas by learning intrinsic structures and properties from given real-world/industrial instances in an implicit fashion. To this end, we introduce a novel SAT representation called Weighted Literal Incidence Graph (WLIG), which exhibits strong representation ability and generalizability against existing counterparts, and can be efficiently generated via a specialized learning-based graph generative model. Decoding from WLIGs into SAT problems is then modeled as finding overlapping cliques with a novel hill-climbing optimization method termed Optimal Weight Coverage (OWC). Experiments demonstrate the superiority of our WLIG-induced approach in terms of graph metrics, efficiency, and scalability in comparison to previous methods. Additionally, we discuss the limitations of graph-based SAT generation for real-world applications, especially when utilizing generated instances for SAT solver parameter-tuning, and pose some potential directions.
    The geometry of hidden representations of large transformer models. (arXiv:2302.00294v1 [cs.LG])
    Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.
    Deep Active Learning for Scientific Computing in the Wild. (arXiv:2302.00098v1 [cs.LG])
    Deep learning (DL) is revolutionizing the scientific computing community. To reduce the data gap caused by usually expensive simulations or experimentation, active learning has been identified as a promising solution for the scientific computing community. However, the deep active learning (DAL) literature is currently dominated by image classification problems and pool-based methods, which are not directly transferrable to scientific computing problems, dominated by regression problems with no pre-defined 'pool' of unlabeled data. Here for the first time, we investigate the robustness of DAL methods for scientific computing problems using ten state-of-the-art DAL methods and eight benchmark problems. We show that, to our surprise, the majority of the DAL methods are not robust even compared to random sampling when the ideal pool size is unknown. We further analyze the effectiveness and robustness of DAL methods and suggest that diversity is necessary for a robust DAL for scientific computing problems.
    Stroke-based Rendering: From Heuristics to Deep Learning. (arXiv:2302.00595v1 [cs.CV])
    In the last few years, artistic image-making with deep learning models has gained a considerable amount of traction. A large number of these models operate directly in the pixel space and generate raster images. This is however not how most humans would produce artworks, for example, by planning a sequence of shapes and strokes to draw. Recent developments in deep learning methods help to bridge the gap between stroke-based paintings and pixel photo generation. With this survey, we aim to provide a structured introduction and understanding of common challenges and approaches in stroke-based rendering algorithms. These algorithms range from simple rule-based heuristics to stroke optimization and deep reinforcement agents, trained to paint images with differentiable vector graphics and neural rendering.
    Dynamic Flows on Curved Space Generated by Labeled Data. (arXiv:2302.00061v1 [cs.LG])
    The scarcity of labeled data is a long-standing challenge for many machine learning tasks. We propose our gradient flow method to leverage the existing dataset (i.e., source) to generate new samples that are close to the dataset of interest (i.e., target). We lift both datasets to the space of probability distributions on the feature-Gaussian manifold, and then develop a gradient flow method that minimizes the maximum mean discrepancy loss. To perform the gradient flow of distributions on the curved feature-Gaussian space, we unravel the Riemannian structure of the space and compute explicitly the Riemannian gradient of the loss function induced by the optimal transport metric. For practical applications, we also propose a discretized flow, and provide conditional results guaranteeing the global convergence of the flow to the optimum. We illustrate the results of our proposed gradient flow method on several real-world datasets and show our method can improve the accuracy of classification models in transfer learning settings.
    Active Uncertainty Reduction for Safe and Efficient Interaction Planning: A Shielding-Aware Dual Control Approach. (arXiv:2302.00171v1 [cs.RO])
    The ability to accurately predict the opponent's behavior is central to the safety and efficiency of robotic systems in interactive settings, such as human-robot interaction and multi-robot teaming tasks. Unfortunately, robots often lack access to key information on which these predictions may hinge, such as opponent's goals, attention, and willingness to cooperate. Dual control theory addresses this challenge by treating unknown parameters of a predictive model as hidden states and inferring their values at runtime using information gathered during system operation. While able to optimally and automatically trade off exploration and exploitation, dual control is computationally intractable for general interactive motion planning. In this paper, we present a novel algorithmic approach to enable active uncertainty reduction for interactive motion planning based on the implicit dual control paradigm. Our approach relies on sampling-based approximation of stochastic dynamic programming, leading to a model predictive control problem. The resulting policy is shown to preserve the dual control effect for a broad class of predictive models with both continuous and categorical uncertainty. To ensure the safe operation of the interacting agents, we leverage a supervisory control scheme, oftentimes referred to as ``shielding'', which overrides the ego agent's dual control policy with a safety fallback strategy when a safety-critical event is imminent. We then augment the dual control framework with an improved variant of the recently proposed shielding-aware robust planning scheme, which proactively balances the nominal planning performance with the risk of high-cost emergency maneuvers triggered by low-probability opponent's behaviors. We demonstrate the efficacy of our approach with both simulated driving examples and hardware experiments using 1/10 scale autonomous vehicles.
    Probabilistic Point Cloud Modeling via Self-Organizing Gaussian Mixture Models. (arXiv:2302.00047v1 [cs.LG])
    This letter presents a continuous probabilistic modeling methodology for spatial point cloud data using finite Gaussian Mixture Models (GMMs) where the number of components are adapted based on the scene complexity. Few hierarchical and adaptive methods have been proposed to address the challenge of balancing model fidelity with size. Instead, state-of-the-art mapping approaches require tuning parameters for specific use cases, but do not generalize across diverse environments. To address this gap, we utilize a self-organizing principle from information-theoretic learning to automatically adapt the complexity of the GMM model based on the relevant information in the sensor data. The approach is evaluated against existing point cloud modeling techniques on real-world data with varying degrees of scene complexity.
    Debiasing Vision-Language Models via Biased Prompts. (arXiv:2302.00070v1 [cs.LG])
    Machine learning models have been shown to inherit biases from their training datasets, which can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. The biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. In this study, we propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. In particular, we show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models. The closed-form solution enables easy integration into large-scale pipelines, and empirical results demonstrate that our approach effectively reduces social bias and spurious correlation in both discriminative and generative vision-language models without the need for additional data or training.
    Online Learning in Dynamically Changing Environments. (arXiv:2302.00103v1 [cs.LG])
    We study the problem of online learning and online regret minimization when samples are drawn from a general unknown non-stationary process. We introduce the concept of a dynamic changing process with cost $K$, where the conditional marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by $K$ over $T$ rounds. For such processes we prove a tight (upto $\sqrt{\log T}$ factor) bound $O(\sqrt{KT\cdot\mathsf{VC}(\mathcal{H})\log T})$ for the expected worst case regret of any finite VC-dimensional class $\mathcal{H}$ under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to $\log^3 T$ factor) regret bound $O(K\cdot\mathsf{VC}(\mathcal{H})\log^3 T)$. We extend these results to general smooth adversary processes with unknown reference measure by showing a sub-linear regret bound for $1$-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the distribution blind (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.
    ezDPS: An Efficient and Zero-Knowledge Machine Learning Inference Pipeline. (arXiv:2212.05428v2 [cs.CR] UPDATED)
    Machine Learning as a service (MLaaS) permits resource-limited clients to access powerful data analytics services ubiquitously. Despite its merits, MLaaS poses significant concerns regarding the integrity of delegated computation and the privacy of the server's model parameters. To address this issue, Zhang et al. (CCS'20) initiated the study of zero-knowledge Machine Learning (zkML). Few zkML schemes have been proposed afterward; however, they focus on sole ML classification algorithms that may not offer satisfactory accuracy or require large-scale training data and model parameters, which may not be desirable for some applications. We propose ezDPS, a new efficient and zero-knowledge ML inference scheme. Unlike prior works, ezDPS is a zkML pipeline in which the data is processed in multiple stages for high accuracy. Each stage of ezDPS is harnessed with an established ML algorithm that is shown to be effective in various applications, including Discrete Wavelet Transformation, Principal Components Analysis, and Support Vector Machine. We design new gadgets to prove ML operations effectively. We fully implemented ezDPS and assessed its performance on real datasets. Experimental results showed that ezDPS achieves one-to-three orders of magnitude more efficient than the generic circuit-based approach in all metrics while maintaining more desirable accuracy than single ML classification approaches.
    Truthful Incentive Mechanism for Federated Learning with Crowdsourced Data Labeling. (arXiv:2302.00106v1 [cs.LG])
    Federated learning (FL) has emerged as a promising paradigm that trains machine learning (ML) models on clients' devices in a distributed manner without the need of transmitting clients' data to the FL server. In many applications of ML, the labels of training data need to be generated manually by human agents. In this paper, we study FL with crowdsourced data labeling where the local data of each participating client of FL are labeled manually by the client. We consider the strategic behavior of clients who may not make desired effort in their local data labeling and local model computation and may misreport their local models to the FL server. We characterize the performance bounds on the training loss as a function of clients' data labeling effort, local computation effort, and reported local models. We devise truthful incentive mechanisms which incentivize strategic clients to make truthful efforts and report true local models to the server. The truthful design exploits the non-trivial dependence of the training loss on clients' efforts and local models. Under the truthful mechanisms, we characterize the server's optimal local computation effort assignments. We evaluate the proposed FL algorithms with crowdsourced data labeling and the incentive mechanisms using experiments.
    ADAPT : Awesome Domain Adaptation Python Toolbox. (arXiv:2107.03049v2 [cs.LG] UPDATED)
    In this paper, we introduce the ADAPT library, an open source Python API providing the implementation of the main transfer learning and domain adaptation methods. The library is designed with a user friendly approach to facilitate the access to domain adaptation for a wide public. ADAPT is compatible with scikit-learn and TensorFlow and a full documentation is proposed online https://adapt-python.github.io/adapt/ with a substantial gallery of examples.
    Learning Optimal Fair Classification Trees: Trade-offs Between Interpretability, Fairness, and Accuracy. (arXiv:2201.09932v3 [cs.LG] UPDATED)
    The increasing use of machine learning in high-stakes domains -- where people's livelihoods are impacted -- creates an urgent need for interpretable, fair, and highly accurate algorithms. With these needs in mind, we propose a mixed integer optimization (MIO) framework for learning optimal classification trees -- one of the most interpretable models -- that can be augmented with arbitrary fairness constraints. In order to better quantify the "price of interpretability", we also propose a new measure of model interpretability called decision complexity that allows for comparisons across different classes of machine learning models. We benchmark our method against state-of-the-art approaches for fair classification on popular datasets; in doing so, we conduct one of the first comprehensive analyses of the trade-offs between interpretability, fairness, and predictive accuracy. Given a fixed disparity threshold, our method has a price of interpretability of about 4.2 percentage points in terms of out-of-sample accuracy compared to the best performing, complex models. However, our method consistently finds decisions with almost full parity, while other methods rarely do.
    Multi-Grade Deep Learning. (arXiv:2302.00150v1 [cs.LG])
    The current deep learning model is of a single-grade, that is, it learns a deep neural network by solving a single nonconvex optimization problem. When the layer number of the neural network is large, it is computationally challenging to carry out such a task efficiently. Inspired by the human education process which arranges learning in grades, we propose a multi-grade learning model: We successively solve a number of optimization problems of small sizes, which are organized in grades, to learn a shallow neural network for each grade. Specifically, the current grade is to learn the leftover from the previous grade. In each of the grades, we learn a shallow neural network stacked on the top of the neural network, learned in the previous grades, which remains unchanged in training of the current and future grades. By dividing the task of learning a deep neural network into learning several shallow neural networks, one can alleviate the severity of the nonconvexity of the original optimization problem of a large size. When all grades of the learning are completed, the final neural network learned is a stair-shape neural network, which is the superposition of networks learned from all grades. Such a model enables us to learn a deep neural network much more effectively and efficiently. Moreover, multi-grade learning naturally leads to adaptive learning. We prove that in the context of function approximation if the neural network generated by a new grade is nontrivial, the optimal error of the grade is strictly reduced from the optimal error of the previous grade. Furthermore, we provide several proof-of-concept numerical examples which demonstrate that the proposed multi-grade model outperforms significantly the traditional single-grade model and is much more robust than the traditional model.
    FLSTRA: Federated Learning in Stratosphere. (arXiv:2302.00163v1 [cs.NI])
    We propose a federated learning (FL) in stratosphere (FLSTRA) system, where a high altitude platform station (HAPS) felicitates a large number of terrestrial clients to collaboratively learn a global model without sharing the training data. FLSTRA overcomes the challenges faced by FL in terrestrial networks, such as slow convergence and high communication delay due to limited client participation and multi-hop communications. HAPS leverages its altitude and size to allow the participation of more clients with line-of-sight (LoS) links and the placement of a powerful server. However, handling many clients at once introduces computing and transmission delays. Thus, we aim to obtain a delay-accuracy trade-off for FLSTRA. Specifically, we first develop a joint client selection and resource allocation algorithm for uplink and downlink to minimize the FL delay subject to the energy and quality-of-service (QoS) constraints. Second, we propose a communication and computation resource-aware (CCRA-FL) algorithm to achieve the target FL accuracy while deriving an upper bound for its convergence rate. The formulated problem is non-convex; thus, we propose an iterative algorithm to solve it. Simulation results demonstrate the effectiveness of the proposed FLSTRA system, compared to terrestrial benchmarks, in terms of FL delay and accuracy.
    Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training. (arXiv:2302.00286v1 [cs.SD])
    In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. %We also argue that transcription models can be used as a preprocessing module for other music analysis tasks. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.
    Epic-Sounds: A Large-scale Dataset of Actions That Sound. (arXiv:2302.00646v1 [cs.SD])
    We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound.
    Transformers Meet Directed Graphs. (arXiv:2302.00049v1 [cs.LG])
    Transformers were originally proposed as a sequence-to-sequence model for text but have become vital for a wide range of modalities, including images, audio, video, and undirected graphs. However, transformers for directed graphs are a surprisingly underexplored topic, despite their applicability to ubiquitous domains including source code and logic circuits. In this work, we propose two direction- and structure-aware positional encodings for directed graphs: (1) the eigenvectors of the Magnetic Laplacian - a direction-aware generalization of the combinatorial Laplacian; (2) directional random walk encodings. Empirically, we show that the extra directionality information is useful in various downstream tasks, including correctness testing of sorting networks and source code understanding. Together with a data-flow-centric graph construction, our model outperforms the prior state of the art on the Open Graph Benchmark Code2 relatively by 14.7%.
    Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP. (arXiv:2208.05516v4 [cs.LG] UPDATED)
    Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.
    Distributed sequential federated learning. (arXiv:2302.00107v1 [stat.ML])
    The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
    On the Within-Group Discrimination of Screening Classifiers. (arXiv:2302.00025v1 [cs.LG])
    Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that, if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest. Further, we argue that this type of discrimination can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each of the groups. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that within-group monotonicity can be often achieved at a small cost in terms of prediction granularity and shortlist size.
    Towards Answering Open-ended Ethical Quandary Questions. (arXiv:2205.05989v3 [cs.CL] UPDATED)
    Considerable advancements have been made in various NLP tasks based on the impressive power of large language models (LLMs) and many NLP applications are deployed in our daily lives. In this work, we challenge the capability of LLMs with the new task of Ethical Quandary Generative Question Answering. Ethical quandary questions are more challenging to address because multiple conflicting answers may exist to a single quandary. We explore the current capability of LLMs in providing an answer with a deliberative exchange of different perspectives to an ethical quandary, in the approach of Socratic philosophy, instead of providing a closed answer like an oracle. We propose a model that searches for different ethical principles applicable to the ethical quandary and generates an answer conditioned on the chosen principles through prompt-based few-shot learning. We also discuss the remaining challenges and ethical issues involved in this task and suggest the direction toward developing responsible NLP systems by incorporating human values explicitly.
    HOAX: A Hyperparameter Optimization Algorithm Explorer for Neural Networks. (arXiv:2302.00374v1 [physics.chem-ph])
    Computational chemistry has become an important tool to predict and understand molecular properties and reactions. Even though recent years have seen a significant growth in new algorithms and computational methods that speed up quantum chemical calculations, the bottleneck for trajectory-based methods to study photoinduced processes is still the huge number of electronic structure calculations. In this work, we present an innovative solution, in which the amount of electronic structure calculations is drastically reduced, by employing machine learning algorithms and methods borrowed from the realm of artificial intelligence. However, applying these algorithms effectively requires finding optimal hyperparameters, which remains a challenge itself. Here we present an automated user-friendly framework, HOAX, to perform the hyperparameter optimization for neural networks, which bypasses the need for a lengthy manual process. The neural network generated potential energy surfaces (PESs) reduces the computational costs compared to the ab initio-based PESs. We perform a comparative investigation on the performance of different hyperparameter optimiziation algorithms, namely grid search, simulated annealing, genetic algorithm, and bayesian optimizer in finding the optimal hyperparameters necessary for constructing the well-performing neural network in order to fit the PESs of small organic molecules. Our results show that this automated toolkit not only facilitate a straightforward way to perform the hyperparameter optimization but also the resulting neural networks-based generated PESs are in reasonable agreement with the ab initio-based PESs.
    Generating High Fidelity Synthetic Data via Coreset selection and Entropic Regularization. (arXiv:2302.00138v1 [cs.LG])
    Generative models have the ability to synthesize data points drawn from the data distribution, however, not all generated samples are high quality. In this paper, we propose using a combination of coresets selection methods and ``entropic regularization'' to select the highest fidelity samples. We leverage an Energy-Based Model which resembles a variational auto-encoder with an inference and generator model for which the latent prior is complexified by an energy-based model. In a semi-supervised learning scenario, we show that augmenting the labeled data-set, by adding our selected subset of samples, leads to better accuracy improvement rather than using all the synthetic samples.
    Training Normalizing Flows with the Precision-Recall Divergence. (arXiv:2302.00628v1 [cs.LG])
    Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.
    Automatically Marginalized MCMC in Probabilistic Programming. (arXiv:2302.00564v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, which often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.
    QLAB: Quadratic Loss Approximation-Based Optimal Learning Rate for Deep Learning. (arXiv:2302.00252v1 [cs.LG])
    We propose a learning rate adaptation scheme, called QLAB, for descent optimizers. We derive QLAB by optimizing the quadratic approximation of the loss function and QLAB can be combined with any optimizer who can provide the descent update direction. The computation of an adaptive learning rate with QLAB requires only computing an extra loss function value. We theoretically prove the convergence of the descent optimizers with QLAB. We demonstrate the effectiveness of QLAB in a range of optimization problems by combining with conclusively stochastic gradient descent, stochastic gradient descent with momentum, and Adam. The performance is validated on multi-layer neural networks, CNN, VGG-Net, ResNet and ShuffleNet with two datasets, MNIST and CIFAR10.
    Simplicity Bias in 1-Hidden Layer Neural Networks. (arXiv:2302.00457v1 [cs.LG])
    Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.
    Electrode Selection for Noninvasive Fetal Electrocardiogram Extraction using Mutual Information Criteria. (arXiv:2302.00206v1 [eess.SP])
    Blind source separation (BSS) techniques have revealed to be promising approaches for, among other, biomedical signal processing applications. Specifically, for the noninvasive extraction of fetal cardiac signals from maternal abdominal recordings, where conventional filtering schemes have failed to extract the complete fetal ECG components. From previous studies, it is now believed that a carefully selected array of electrodes well-placed over the abdomen of a pregnant woman contains the required `information' for BSS, to extract the complete fetal components. Based on this idea, in previous works array recording systems and sensor selection strategies based on the Mutual Information (MI) criterion have been developed. In this paper the previous works have been extended, by considering the 3-dimensional aspects of the cardiac electrical activity. The proposed method has been tested on simulated and real maternal abdominal recordings. The results show that the new sensor selection strategy together with the MI criterion, can be effectively used to select the channels containing the most `information' concerning the fetal ECG components from an array of 72 recordings. The method is hence believed to be useful for the selection of the most informative channels in online applications, considering the different fetal positions and movements.
    Iterative Deepening Hyperband. (arXiv:2302.00511v1 [cs.LG])
    Hyperparameter optimization (HPO) is concerned with the automated search for the most appropriate hyperparameter configuration (HPC) of a parameterized machine learning algorithm. A state-of-the-art HPO method is Hyperband, which, however, has its own parameters that influence its performance. One of these parameters, the maximal budget, is especially problematic: If chosen too small, the budget needs to be increased in hindsight and, as Hyperband is not incremental by design, the entire algorithm must be re-run. This is not only costly but also comes with a loss of valuable knowledge already accumulated. In this paper, we propose incremental variants of Hyperband that eliminate these drawbacks, and show that these variants satisfy theoretical guarantees qualitatively similar to those for the original Hyperband with the "right" budget. Moreover, we demonstrate their practical utility in experiments with benchmark data sets.
    Decompositional Generation Process for Instance-Dependent Partial Label Learning. (arXiv:2204.03845v3 [cs.LG] UPDATED)
    Partial label learning (PLL) is a typical weakly supervised learning problem, where each training example is associated with a set of candidate labels among which only one is true. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels and model the generation process of the candidate labels in a simple way. However, these approaches usually do not perform as well as expected due to the fact that the generation process of the candidate labels is always instance-dependent. Therefore, it deserves to be modeled in a refined way. In this paper, we consider instance-dependent PLL and assume that the generation process of the candidate labels could decompose into two sequential parts, where the correct label emerges first in the mind of the annotator but then the incorrect labels related to the feature are also selected with the correct label as candidate labels due to uncertainty of labeling. Motivated by this consideration, we propose a novel PLL method that performs Maximum A Posterior (MAP) based on an explicitly modeled generation process of candidate labels via decomposed probability distribution models. Extensive experiments on manually corrupted benchmark datasets and real-world datasets validate the effectiveness of the proposed method. Source code is available at https://github.com/palm-ml/idgp.
    SPIDE: A Purely Spike-based Method for Training Feedback Spiking Neural Networks. (arXiv:2302.00232v1 [cs.NE])
    Spiking neural networks (SNNs) with event-based computation are promising brain-inspired models for energy-efficient applications on neuromorphic hardware. However, most supervised SNN training methods, such as conversion from artificial neural networks or direct training with surrogate gradients, require complex computation rather than spike-based operations of spiking neurons during training. In this paper, we study spike-based implicit differentiation on the equilibrium state (SPIDE) that extends the recently proposed training method, implicit differentiation on the equilibrium state (IDE), for supervised learning with purely spike-based computation, which demonstrates the potential for energy-efficient training of SNNs. Specifically, we introduce ternary spiking neuron couples and prove that implicit differentiation can be solved by spikes based on this design, so the whole training procedure, including both forward and backward passes, is made as event-driven spike computation, and weights are updated locally with two-stage average firing rates. Then we propose to modify the reset membrane potential to reduce the approximation error of spikes. With these key components, we can train SNNs with flexible structures in a small number of time steps and with firing sparsity during training, and the theoretical estimation of energy costs demonstrates the potential for high efficiency. Meanwhile, experiments show that even with these constraints, our trained models can still achieve competitive results on MNIST, CIFAR-10, CIFAR-100, and CIFAR10-DVS. Our code is available at https://github.com/pkuxmq/SPIDE-FSNN.
    Dictionary-based Manifold Learning. (arXiv:2302.00263v1 [cs.LG])
    We propose a paradigm for interpretable Manifold Learning for scientific data analysis, whereby we parametrize a manifold with $d$ smooth functions from a scientist-provided dictionary of meaningful, domain-related functions. When such a parametrization exists, we provide an algorithm for finding it based on sparse non-linear regression in the manifold tangent bundle, bypassing more standard manifold learning algorithms. We also discuss conditions for the existence of such parameterizations in function space and for successful recovery from finite samples. We demonstrate our method with experimental results from a real scientific domain.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v2 [stat.ML] UPDATED)
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. The main idea is to exploit the temporal dependence of non-conformity scores (e.g., prediction residuals); thus, the past residuals contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a user-specified point prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    End-to-End Full-Atom Antibody Design. (arXiv:2302.00203v1 [q-bio.BM])
    Antibody design is an essential yet challenging task in various domains like therapeutics and biology. There are two major defects in current learning-based methods: 1) tackling only a certain subtask of the whole antibody design pipeline, making them suboptimal or resource-intensive. 2) omitting either the framework regions or side chains, thus incapable of capturing the full-atom geometry. To address these pitfalls, we propose dynamic Multi-channel Equivariant grAph Network (dyMEAN), an end-to-end full-atom model for E(3)-equivariant antibody design given the epitope and the incomplete sequence of the antibody. Specifically, we first explore structural initialization as a knowledgeable guess of the antibody structure and then propose shadow paratope to bridge the epitope-antibody connections. Both 1D sequences and 3D structures are updated via an adaptive multi-channel equivariant encoder that is able to process protein residues of variable sizes when considering full atoms. Finally, the updated antibody is docked to the epitope via the alignment of the shadow paratope. Experiments on epitope-binding CDR-H3 design, complex structure prediction, and affinity optimization demonstrate the superiority of our end-to-end framework and full-atom modeling.
    Whats Missing? Learning Hidden Markov Models When the Locations of Missing Observations are Unknown. (arXiv:2203.06527v2 [stat.ML] UPDATED)
    The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.
    Width and Depth Limits Commute in Residual Networks. (arXiv:2302.00453v1 [stat.ML])
    We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
    Exploring Semantic Perturbations on Grover. (arXiv:2302.00509v1 [cs.LG])
    With news and information being as easy to access as they currently are, it is more important than ever to ensure that people are not mislead by what they read. Recently, the rise of neural fake news (AI-generated fake news) and its demonstrated effectiveness at fooling humans has prompted the development of models to detect it. One such model is the Grover model, which can both detect neural fake news to prevent it, and generate it to demonstrate how a model could be misused to fool human readers. In this work we explore the Grover model's fake news detection capabilities by performing targeted attacks through perturbations on input news articles. Through this we test Grover's resilience to these adversarial attacks and expose some potential vulnerabilities which should be addressed in further iterations to ensure it can detect all types of fake news accurately.
    Gradient Descent in Neural Networks as Sequential Learning in RKBS. (arXiv:2302.00205v1 [stat.ML])
    The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.
    Molecular Graph Generation by Decomposition and Reassembling. (arXiv:2302.00587v1 [q-bio.BM])
    Designing molecular structures with desired chemical properties is an essential task in drug discovery and material design. However, finding molecules with the optimized desired properties is still a challenging task due to combinatorial explosion of candidate space of molecules. Here we propose a novel \emph{decomposition-and-reassembling} based approach, which does not include any optimization in hidden space and our generation process is highly interpretable. Our method is a two-step procedure: In the first decomposition step, we apply frequent subgraph mining to a molecular database to collect smaller size of subgraphs as building blocks of molecules. In the second reassembling step, we search desirable building blocks guided via reinforcement learning and combine them to generate new molecules. Our experiments show that not only can our method find better molecules in terms of two standard criteria, the penalized $\log P$ and drug-likeness, but also generate drug molecules with showing the valid intermediate molecules.
    Physics-informed Reduced-Order Learning from the First Principles for Simulation of Quantum Nanostructures. (arXiv:2302.00100v1 [cs.CE])
    Multi-dimensional direct numerical simulation (DNS) of the Schr\"odinger equation is needed for design and analysis of quantum nanostructures that offer numerous applications in biology, medicine, materials, electronic/photonic devices, etc. In large-scale nanostructures, extensive computational effort needed in DNS may become prohibitive due to the high degrees of freedom (DoF). This study employs a reduced-order learning algorithm, enabled by the first principles, for simulation of the Schr\"odinger equation to achieve high accuracy and efficiency. The proposed simulation methodology is applied to investigate two quantum-dot structures; one operates under external electric field, and the other is influenced by internal potential variation with periodic boundary conditions. The former is similar to typical operations of nanoelectronic devices, and the latter is of interest to simulation and design of nanostructures and materials, such as applications of density functional theory. Using the proposed methodology, a very accurate prediction can be realized with a reduction in the DoF by more than 3 orders of magnitude and in the computational time by 2 orders, compared to DNS. The proposed physics-informed learning methodology is also able to offer an accurate prediction beyond the training conditions, including higher external field and larger internal potential in untrained quantum states.
    Learning Functional Transduction. (arXiv:2302.00328v1 [cs.LG])
    Research in Machine Learning has polarized into two general regression approaches: Transductive methods derive estimates directly from available data but are usually problem unspecific. Inductive methods can be much more particular, but generally require tuning and compute-intensive searches for solutions. In this work, we adopt a hybrid approach: We leverage the theory of Reproducing Kernel Banach Spaces (RKBS) and show that transductive principles can be induced through gradient descent to form efficient \textit{in-context} neural approximators. We apply this approach to RKBS of function-valued operators and show that once trained, our \textit{Transducer} model can capture on-the-fly relationships between infinite-dimensional input and output functions, given a few example pairs, and return new function estimates. We demonstrate the benefit of our transductive approach to model complex physical systems influenced by varying external factors with little data at a fraction of the usual deep learning training computation cost for partial differential equations and climate modeling applications.
    Diffusion-based Image Translation using Disentangled Style and Content Representation. (arXiv:2209.15264v2 [cs.CV] UPDATED)
    Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
    Learning noisy-OR Bayesian Networks with Max-Product Belief Propagation. (arXiv:2302.00099v1 [cs.LG])
    Noisy-OR Bayesian Networks (BNs) are a family of probabilistic graphical models which express rich statistical dependencies in binary data. Variational inference (VI) has been the main method proposed to learn noisy-OR BNs with complex latent structures (Jaakkola & Jordan, 1999; Ji et al., 2020; Buhai et al., 2020). However, the proposed VI approaches either (a) use a recognition network with standard amortized inference that cannot induce ``explaining-away''; or (b) assume a simple mean-field (MF) posterior which is vulnerable to bad local optima. Existing MF VI methods also update the MF parameters sequentially which makes them inherently slow. In this paper, we propose parallel max-product as an alternative algorithm for learning noisy-OR BNs with complex latent structures and we derive a fast stochastic training scheme that scales to large datasets. We evaluate both approaches on several benchmarks where VI is the state-of-the-art and show that our method (a) achieves better test performance than Ji et al. (2020) for learning noisy-OR BNs with hierarchical latent structures on large sparse real datasets; (b) recovers a higher number of ground truth parameters than Buhai et al. (2020) from cluttered synthetic scenes; and (c) solves the 2D blind deconvolution problem from Lazaro-Gredilla et al. (2021) and variant - including binary matrix factorization - while VI catastrophically fails and is up to two orders of magnitude slower.
    Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction. (arXiv:2302.00239v1 [cs.LG])
    We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.
    How to select predictive models for causal inference?. (arXiv:2302.00370v1 [stat.ML])
    Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
    Personalized Privacy Auditing and Optimization at Test Time. (arXiv:2302.00077v1 [cs.LG])
    A number of learning models used in consequential domains, such as to assist in legal, banking, hiring, and healthcare decisions, make use of potentially sensitive users' information to carry out inference. Further, the complete set of features is typically required to perform inference. This not only poses severe privacy risks for the individuals using the learning systems, but also requires companies and organizations massive human efforts to verify the correctness of the released information. This paper asks whether it is necessary to require \emph{all} input features for a model to return accurate predictions at test time and shows that, under a personalized setting, each individual may need to release only a small subset of these features without impacting the final decisions. The paper also provides an efficient sequential algorithm that chooses which attributes should be provided by each individual. Evaluation over several learning tasks shows that individuals may be able to report as little as 10\% of their information to ensure the same level of accuracy of a model that uses the complete users' information.
    Mind the (optimality) Gap: A Gap-Aware Learning Rate Scheduler for Adversarial Nets. (arXiv:2302.00089v1 [cs.LG])
    Adversarial nets have proved to be powerful in various domains including generative modeling (GANs), transfer learning, and fairness. However, successfully training adversarial nets using first-order methods remains a major challenge. Typically, careful choices of the learning rates are needed to maintain the delicate balance between the competing networks. In this paper, we design a novel learning rate scheduler that dynamically adapts the learning rate of the adversary to maintain the right balance. The scheduler is driven by the fact that the loss of an ideal adversarial net is a constant known a priori. The scheduler is thus designed to keep the loss of the optimized adversarial net close to that of an ideal network. We run large-scale experiments to study the effectiveness of the scheduler on two popular applications: GANs for image generation and adversarial nets for domain adaptation. Our experiments indicate that adversarial nets trained with the scheduler are less likely to diverge and require significantly less tuning. For example, on CelebA, a GAN with the scheduler requires only one-tenth of the tuning budget needed without a scheduler. Moreover, the scheduler leads to statistically significant improvements in model quality, reaching up to $27\%$ in Frechet Inception Distance for image generation and $3\%$ in test accuracy for domain adaptation.
    Neuromechanical Autoencoders: Learning to Couple Elastic and Neural Network Nonlinearity. (arXiv:2302.00032v1 [cs.LG])
    Intelligent biological systems are characterized by their embodiment in a complex environment and the intimate interplay between their nervous systems and the nonlinear mechanical properties of their bodies. This coordination, in which the dynamics of the motor system co-evolved to reduce the computational burden on the brain, is referred to as ``mechanical intelligence'' or ``morphological computation''. In this work, we seek to develop machine learning analogs of this process, in which we jointly learn the morphology of complex nonlinear elastic solids along with a deep neural network to control it. By using a specialized differentiable simulator of elastic mechanics coupled to conventional deep learning architectures -- which we refer to as neuromechanical autoencoders -- we are able to learn to perform morphological computation via gradient descent. Key to our approach is the use of mechanical metamaterials -- cellular solids, in particular -- as the morphological substrate. Just as deep neural networks provide flexible and massively-parametric function approximators for perceptual and control tasks, cellular solid metamaterials are promising as a rich and learnable space for approximating a variety of actuation tasks. In this work we take advantage of these complementary computational concepts to co-design materials and neural network controls to achieve nonintuitive mechanical behavior. We demonstrate in simulation how it is possible to achieve translation, rotation, and shape matching, as well as a ``digital MNIST'' task. We additionally manufacture and evaluate one of the designs to verify its real-world behavior.
    Hierarchical Classification of Research Fields in the "Web of Science" Using Deep Learning. (arXiv:2302.00390v1 [cs.DL])
    The scholarly publication space is growing steadily not just in numbers but also in complexity due to collaboration between individuals from within and across fields of research. This paper presents a hierarchical classification system that automatically categorizes a scholarly publication using its abstract into a three-tier hierarchical label set of fields (discipline-field-subfield). This system enables a holistic view about the interdependence of research activities in the mentioned hierarchical tiers in terms of knowledge production through articles and impact through citations. The classification system (44 disciplines - 738 fields - 1,501 subfields) utilizes and is able to cope with 160 million abstract snippets in Microsoft Academic Graph (Version 2018-05-17) using batch training in a modularized and distributed fashion to address and assess interdisciplinarity and inter-field classifications. In addition, we have explored multi-class classifications in both the single-label and multi-label settings. In total, we have conducted 3,140 experiments, in all models (Convolutional Neural Networks, Recurrent Neural Networks, Transformers), the classification accuracy is > 90% in 77.84% and 78.83% of the single-label and multi-label classifications, respectively. We examine the advantages of our classification by its ability to better align research texts and output with disciplines, to adequately classify them in an automated way, as well as to capture the degree of interdisciplinarity in a publication which enables downstream analytics such as field interdisciplinarity. This system (a set of pretrained models) can serve as a backbone to an interactive system of indexing scientific publications.
    Student-centric Model of Learning Management System Activity and Academic Performance: from Correlation to Causation. (arXiv:2210.15430v2 [cs.CY] UPDATED)
    In recent years, there is a lot of interest in modeling students' digital traces in Learning Management System (LMS) to understand students' learning behavior patterns including aspects of meta-cognition and self-regulation, with the ultimate goal to turn those insights into actionable information to support students to improve their learning outcomes. In achieving this goal, however, there are two main issues that need to be addressed given the existing literature. Firstly, most of the current work is course-centered (i.e. models are built from data for a specific course) rather than student-centered; secondly, a vast majority of the models are correlational rather than causal. Those issues make it challenging to identify the most promising actionable factors for intervention at the student level where most of the campus-wide academic support is designed for. In this paper, we explored a student-centric analytical framework for LMS activity data that can provide not only correlational but causal insights mined from observational data. We demonstrated this approach using a dataset of 1651 computing major students at a public university in the US during one semester in the Fall of 2019. This dataset includes students' fine-grained LMS interaction logs and administrative data, e.g. demographics and academic performance. In addition, we expand the repository of LMS behavior indicators to include those that can characterize the time-of-the-day of login (e.g. chronotype). Our analysis showed that student login volume, compared with other login behavior indicators, is both strongly correlated and causally linked to student academic performance, especially among students with low academic performance. We envision that those insights will provide convincing evidence for college student support groups to launch student-centered and targeted interventions that are effective and scalable.
    Effectiveness of Moving Target Defenses for Adversarial Attacks in ML-based Malware Detection. (arXiv:2302.00537v1 [cs.LG])
    Several moving target defenses (MTDs) to counter adversarial ML attacks have been proposed in recent years. MTDs claim to increase the difficulty for the attacker in conducting attacks by regularly changing certain elements of the defense, such as cycling through configurations. To examine these claims, we study for the first time the effectiveness of several recent MTDs for adversarial ML attacks applied to the malware detection domain. Under different threat models, we show that transferability and query attack strategies can achieve high levels of evasion against these defenses through existing and novel attack strategies across Android and Windows. We also show that fingerprinting and reconnaissance are possible and demonstrate how attackers may obtain critical defense hyperparameters as well as information about how predictions are produced. Based on our findings, we present key recommendations for future work on the development of effective MTDs for adversarial attacks in ML-based malware detection.
    Learning Generalized Zero-Shot Learners for Open-Domain Image Geolocalization. (arXiv:2302.00275v1 [cs.CV])
    Image geolocalization is the challenging task of predicting the geographic coordinates of origin for a given photo. It is an unsolved problem relying on the ability to combine visual clues with general knowledge about the world to make accurate predictions across geographies. We present $\href{https://huggingface.co/geolocal/StreetCLIP}{\text{StreetCLIP}}$, a robust, publicly available foundation model not only achieving state-of-the-art performance on multiple open-domain image geolocalization benchmarks but also doing so in a zero-shot setting, outperforming supervised models trained on more than 4 million images. Our method introduces a meta-learning approach for generalized zero-shot learning by pretraining CLIP from synthetic captions, grounding CLIP in a domain of choice. We show that our method effectively transfers CLIP's generalized zero-shot capabilities to the domain of image geolocalization, improving in-domain generalized zero-shot performance without finetuning StreetCLIP on a fixed set of classes.
    Fast Sampling of Diffusion Models via Operator Learning. (arXiv:2211.13449v2 [cs.LG] UPDATED)
    Diffusion models have found widespread adoption in various areas. However, their sampling process is slow because it requires hundreds to thousands of network evaluations to emulate a continuous process defined by differential equations. In this work, we use neural operators, an efficient method to solve the probability flow differential equations, to accelerate the sampling process of diffusion models. Compared to other fast sampling methods that have a sequential nature, we are the first to propose parallel decoding method that generates images with only one model forward pass. We propose \textit{diffusion model sampling with neural operator} (DSNO) that maps the initial condition, i.e., Gaussian distribution, to the continuous-time solution trajectory of the reverse diffusion process. To model the temporal correlations along the trajectory, we introduce temporal convolution layers that are parameterized in the Fourier space into the given diffusion model backbone. We show our method achieves state-of-the-art FID of 4.12 for CIFAR-10 and 8.35 for ImageNet-64 in the one-model-evaluation setting.
    Predicting CSI Sequences With Attention-Based Neural Networks. (arXiv:2302.00341v1 [stat.ML])
    In this work, we consider the problem of multi-step channel prediction in wireless communication systems. In existing works, autoregressive (AR) models are either replaced or combined with feed-forward neural networks(NNs) or, alternatively, with recurrent neural networks (RNNs). This paper explores the possibility of using sequence-to-sequence (Seq2Seq) and transformer neural network (TNN) models for channel state information (CSI) prediction. Simulation results show that both, Seq2Seq and TNNs, represent an appealing alternative to RNNs and feed-forward NNs in the context of CSI prediction. Additionally, the TNN with a few adaptations can extrapolate better than other models to CSI sequences that are either shorter or longer than the ones the model saw during training.
    Distributed Traffic Synthesis and Classification in Edge Networks: A Federated Self-supervised Learning Approach. (arXiv:2302.00207v1 [cs.LG])
    With the rising demand for wireless services and increased awareness of the need for data protection, existing network traffic analysis and management architectures are facing unprecedented challenges in classifying and synthesizing the increasingly diverse services and applications. This paper proposes FS-GAN, a federated self-supervised learning framework to support automatic traffic analysis and synthesis over a large number of heterogeneous datasets. FS-GAN is composed of multiple distributed Generative Adversarial Networks (GANs), with a set of generators, each being designed to generate synthesized data samples following the distribution of an individual service traffic, and each discriminator being trained to differentiate the synthesized data samples and the real data samples of a local dataset. A federated learning-based framework is adopted to coordinate local model training processes of different GANs across different datasets. FS-GAN can classify data of unknown types of service and create synthetic samples that capture the traffic distribution of the unknown types. We prove that FS-GAN can minimize the Jensen-Shannon Divergence (JSD) between the distribution of real data across all the datasets and that of the synthesized data samples. FS-GAN also maximizes the JSD among the distributions of data samples created by different generators, resulting in each generator producing synthetic data samples that follow the same distribution as one particular service type. Extensive simulation results show that the classification accuracy of FS-GAN achieves over 20% improvement in average compared to the state-of-the-art clustering-based traffic analysis algorithms. FS-GAN also has the capability to synthesize highly complex mixtures of traffic types without requiring any human-labeled data samples.
    Learning Topology-Preserving Data Representations. (arXiv:2302.00136v1 [cs.LG])
    We propose a method for learning topology-preserving data representations (dimensionality reduction). The method aims to provide topological similarity between the data manifold and its latent representation via enforcing the similarity in topological features (clusters, loops, 2D voids, etc.) and their localization. The core of the method is the minimization of the Representation Topology Divergence (RTD) between original high-dimensional data and low-dimensional representation in latent space. RTD minimization provides closeness in topological features with strong theoretical guarantees. We develop a scheme for RTD differentiation and apply it as a loss term for the autoencoder. The proposed method ``RTD-AE'' better preserves the global structure and topology of the data manifold than state-of-the-art competitors as measured by linear correlation, triplet distance ranking accuracy, and Wasserstein distance between persistence barcodes.
    Learning to be Fair: A Consequentialist Approach to Equitable Decision-Making. (arXiv:2109.08792v3 [cs.LG] UPDATED)
    In the dominant paradigm for designing equitable machine learning systems, one works to ensure that model predictions satisfy various fairness criteria, such as parity in error rates across race, gender, and other legally protected traits. That approach, however, typically ignores the downstream decisions and outcomes that predictions affect, and, as a result, can induce unexpected harms. Here we present an alternative framework for fairness that directly anticipates the consequences of decisions. Stakeholders first specify preferences over the possible outcomes of an algorithmically informed decision-making process. For example, lenders may prefer extending credit to those most likely to repay a loan, while also preferring similar lending rates across neighborhoods. One then searches the space of decision policies to maximize the specified utility. We develop and describe a method for efficiently learning these optimal policies from data for a large family of expressive utility functions, facilitating a more holistic approach to equitable decision-making.
    A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee. (arXiv:2302.00248v1 [cs.DS])
    Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.
    Homotopy-based training of NeuralODEs for accurate dynamics discovery. (arXiv:2210.01407v3 [cs.LG] UPDATED)
    Conceptually, Neural Ordinary Differential Equations (NeuralODEs) pose an attractive way to extract dynamical laws from time series data, as they are natural extensions of the traditional differential equation-based modeling paradigm of the physical sciences. In practice, NeuralODEs display long training times and suboptimal results, especially for longer duration data where they may fail to fit the data altogether. While methods have been proposed to stabilize NeuralODE training, many of these involve placing a strong constraint on the functional form the trained NeuralODE can take that the actual underlying governing equation does not guarantee satisfaction. In this work, we present a novel NeuralODE training algorithm that leverages tools from the chaos and mathematical optimization communities - synchronization and homotopy optimization - for a breakthrough in tackling the NeuralODE training obstacle. We demonstrate architectural changes are unnecessary for effective NeuralODE training. Compared to the conventional training methods, our algorithm achieves drastically lower loss values without any changes to the model architectures. Experiments on both simulated and real systems with complex temporal behaviors demonstrate NeuralODEs trained with our algorithm are able to accurately capture true long term behaviors and correctly extrapolate into the future.
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets. (arXiv:2010.12909v3 [cs.LG] UPDATED)
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
    Delayed Feedback in Kernel Bandits. (arXiv:2302.00392v1 [stat.ML])
    Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{\Gamma_k(T)T}+\mathbb{E}[\tau])$ regret, where $T$ is the number of time steps, $\Gamma_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $\tau$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(\Gamma_k(T)\sqrt{T}+\mathbb{E}[\tau]\Gamma_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.
    $\rm A^2Q$: Aggregation-Aware Quantization for Graph Neural Networks. (arXiv:2302.00193v1 [cs.LG])
    As graph data size increases, the vast latency and memory consumption during inference pose a significant challenge to the real-world deployment of Graph Neural Networks (GNNs). While quantization is a powerful approach to reducing GNNs complexity, most previous works on GNNs quantization fail to exploit the unique characteristics of GNNs, suffering from severe accuracy degradation. Through an in-depth analysis of the topology of GNNs, we observe that the topology of the graph leads to significant differences between nodes, and most of the nodes in a graph appear to have a small aggregation value. Motivated by this, in this paper, we propose the Aggregation-Aware mixed-precision Quantization ($\rm A^2Q$) for GNNs, where an appropriate bitwidth is automatically learned and assigned to each node in the graph. To mitigate the vanishing gradient problem caused by sparse connections between nodes, we propose a Local Gradient method to serve the quantization error of the node features as the supervision during training. We also develop a Nearest Neighbor Strategy to deal with the generalization on unseen graphs. Extensive experiments on eight public node-level and graph-level datasets demonstrate the generality and robustness of our proposed method. Compared to the FP32 models, our method can achieve up to a 18.6x (i.e., 1.70bit) compression ratio with negligible accuracy degradation. Morever, compared to the state-of-the-art quantization method, our method can achieve up to 11.4\% and 9.5\% accuracy improvements on the node-level and graph-level tasks, respectively, and up to 2x speedup on a dedicated hardware accelerator.
    Weight Prediction Boosts the Convergence of AdamW. (arXiv:2302.00195v1 [cs.LG])
    In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to the update rule of AdamW and then apply the predicted future weights to do both forward pass and backward propagation. In this way, the AdamW optimizer always utilizes the gradients w.r.t. the future weights instead of current weights to update the DNN parameters, making the AdamW optimizer achieve better convergence. Our proposal is simple and straightforward to implement but effective in boosting the convergence of DNN training. We performed extensive experimental evaluations on image classification and language modeling tasks to verify the effectiveness of our proposal. The experimental results validate that our proposal can boost the convergence of AdamW and achieve better accuracy than AdamW when training the DNN models.
    TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation. (arXiv:2302.00247v1 [cs.LG])
    Model parallelism has become necessary to train large neural networks. However, finding a suitable model parallel schedule for an arbitrary neural network is a non-trivial task due to the exploding search space. In this work, we present a model parallelism framework TAP that automatically searches for the best data and tensor parallel schedules. Leveraging the key insight that a neural network can be represented as a directed acyclic graph, within which may only exist a limited set of frequent subgraphs, we design a graph pruning algorithm to fold the search space efficiently. TAP runs at sub-linear complexity concerning the neural network size. Experiments show that TAP is $20\times- 160\times$ faster than the state-of-the-art automatic parallelism framework, and the performance of its discovered schedules is competitive with the expert-engineered ones.
    A Transaction Represented with Weighted Finite-State Transducers. (arXiv:2302.00200v1 [cs.FL])
    Not all contracts are good, but all good contracts can be expressed as a finite-state transition system ("State-Transition Contracts"). Contracts that can be represented as State-Transition Contracts discretize fat-tailed risk to foreseeable, managed risk, define the boundary of relevant events governed by the relationship, and eliminate the potential of inconsistent contractual provisions. Additionally, State-Transition Contracts reap the substantial benefit of being able to be analyzed under the rules governing the science of the theory of computation. Simple State-Transition Contracts can be represented as discrete finite automata; more complicated State-Transition Contracts, such as those that have downstream effects on other agreements or complicated pathways of performance, benefit from representation as weighted finite-state transducers, with weights assigned as costs, penalties, or probabilities of transitions. This research paper (the "Research" or "Paper") presents a complex legal transaction represented as weighted finite-state transducers. Furthermore, we show that the mathematics/algorithms permitted by the algebraic structure of weighted finite-state transducers provides actionable, legal insight into the transaction.
    Adaptive sparseness for correntropy-based robust regression via automatic relevance determination. (arXiv:2302.00082v1 [cs.LG])
    Sparseness and robustness are two important properties for many machine learning scenarios. In the present study, regarding the maximum correntropy criterion (MCC) based robust regression algorithm, we investigate to integrate the MCC method with the automatic relevance determination (ARD) technique in a Bayesian framework, so that MCC-based robust regression could be implemented with adaptive sparseness. To be specific, we use an inherent noise assumption from the MCC to derive an explicit likelihood function, and realize the maximum a posteriori (MAP) estimation with the ARD prior by variational Bayesian inference. Compared to the existing robust and sparse L1-regularized MCC regression, the proposed MCC-ARD regression can eradicate the troublesome tuning for the regularization hyper-parameter which controls the regularization strength. Further, MCC-ARD achieves superior prediction performance and feature selection capability than L1-regularized MCC, as demonstrated by a noisy and high-dimensional simulation study.
    A Prescriptive Learning Analytics Framework: Beyond Predictive Modelling and onto Explainable AI with Prescriptive Analytics and ChatGPT. (arXiv:2208.14582v2 [cs.LG] UPDATED)
    A significant body of recent research in the field of Learning Analytics has focused on leveraging machine learning approaches for predicting at-risk students in order to initiate timely interventions and thereby elevate retention and completion rates. The overarching feature of the majority of these research studies has been on the science of prediction only. The component of predictive analytics concerned with interpreting the internals of the models and explaining their predictions for individual cases to stakeholders has largely been neglected. Additionally, works that attempt to employ data-driven prescriptive analytics to automatically generate evidence-based remedial advice for at-risk learners are in their infancy. eXplainable AI is a field that has recently emerged providing cutting-edge tools which support transparent predictive analytics and techniques for generating tailored advice for at-risk students. This study proposes a novel framework that unifies both transparent machine learning as well as techniques for enabling prescriptive analytics, while integrating the latest advances in large language models. This work practically demonstrates the proposed framework using predictive models for identifying at-risk learners of programme non-completion. The study then further demonstrates how predictive modelling can be augmented with prescriptive analytics on two case studies in order to generate human-readable prescriptive feedback for those who are at risk using ChatGPT.
    Selective Uncertainty Propagation in Offline RL. (arXiv:2302.00284v1 [cs.LG])
    We study the finite-horizon offline reinforcement learning (RL) problem. Since actions at any state can affect next-state distributions, the related distributional shift challenges can make this problem far more statistically complex than offline policy learning for a finite sequence of stochastic contextual bandit environments. We formalize this insight by showing that the statistical hardness of offline RL instances can be measured by estimating the size of actions' impact on next-state distributions. Furthermore, this estimated impact allows us to propagate just enough value function uncertainty from future steps to avoid model exploitation, enabling us to develop algorithms that improve upon traditional pessimistic approaches for offline RL on statistically simple instances. Our approach is supported by theory and simulations.
    Revisiting Bellman Errors for Offline Model Selection. (arXiv:2302.00141v1 [cs.LG])
    Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.
    Program Generation from Diverse Video Demonstrations. (arXiv:2302.00178v1 [cs.CV])
    The ability to use inductive reasoning to extract general rules from multiple observations is a vital indicator of intelligence. As humans, we use this ability to not only interpret the world around us, but also to predict the outcomes of the various interactions we experience. Generalising over multiple observations is a task that has historically presented difficulties for machines to grasp, especially when requiring computer vision. In this paper, we propose a model that can extract general rules from video demonstrations by simultaneously performing summarisation and translation. Our approach differs from prior works by framing the problem as a multi-sequence-to-sequence task, wherein summarisation is learnt by the model. This allows our model to utilise edge cases that would otherwise be suppressed or discarded by traditional summarisation techniques. Additionally, we show that our approach can handle noisy specifications without the need for additional filtering methods. We evaluate our model by synthesising programs from video demonstrations in the Vizdoom environment achieving state-of-the-art results with a relative increase of 11.75% program accuracy on prior works
    Distillation Policy Optimization. (arXiv:2302.00533v1 [cs.LG])
    On-policy algorithms are supposed to be stable, however, sample-intensive yet. Off-policy algorithms utilizing past experiences are deemed to be sample-efficient, nevertheless, unstable in general. Can we design an algorithm that can employ the off-policy data, while exploit the stable learning by sailing along the course of the on-policy walkway? In this paper, we present an actor-critic learning framework that borrows the distributional perspective of interest to evaluate, and cross-breeds two sources of the data for policy improvement, which enables fast learning and can be applied to a wide class of algorithms. In its backbone, the variance reduction mechanisms, such as unified advantage estimator (UAE), that extends generalized advantage estimator (GAE) to be applicable on any state-dependent baseline, and a learned baseline, that is competent to stabilize the policy gradient, are firstly put forward to not merely be a bridge to the action-value function but also distill the advantageous learning signal. Lastly, it is empirically shown that our method improves sample efficiency and interpolates different levels well. Being of an organic whole, its mixture places more inspiration to the algorithm design.
    GANravel: User-Driven Direction Disentanglement in Generative Adversarial Networks. (arXiv:2302.00079v1 [cs.HC])
    Generative adversarial networks (GANs) have many application areas including image editing, domain translation, missing data imputation, and support for creative work. However, GANs are considered 'black boxes'. Specifically, the end-users have little control over how to improve editing directions through disentanglement. Prior work focused on new GAN architectures to disentangle editing directions. Alternatively, we propose GANravel a user-driven direction disentanglement tool that complements the existing GAN architectures and allows users to improve editing directions iteratively. In two user studies with 16 participants each, GANravel users were able to disentangle directions and outperformed the state-of-the-art direction discovery baselines in disentanglement performance. In the second user study, GANravel was used in a creative task of creating dog memes and was able to create high-quality edited images and GIFs.  ( 2 min )
    Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor. (arXiv:2302.00036v1 [cs.LG])
    We introduce the Blackwell discount factor for Markov Decision Processes (MDPs). Classical objectives for MDPs include discounted, average, and Blackwell optimality. Many existing approaches to computing average-optimal policies solve for discounted optimal policies with a discount factor close to $1$, but they only work under strong or hard-to-verify assumptions such as ergodicity or weakly communicating MDPs. In this paper, we show that when the discount factor is larger than the Blackwell discount factor $\gamma_{\mathrm{bw}}$, all discounted optimal policies become Blackwell- and average-optimal, and we derive a general upper bound on $\gamma_{\mathrm{bw}}$. The upper bound on $\gamma_{\mathrm{bw}}$ provides the first reduction from average and Blackwell optimality to discounted optimality, without any assumptions, and new polynomial-time algorithms for average- and Blackwell-optimal policies. Our work brings new ideas from the study of polynomials and algebraic numbers to the analysis of MDPs. Our results also apply to robust MDPs, enabling the first algorithms to compute robust Blackwell-optimal policies.  ( 2 min )
  • Open

    Data fission: splitting a single data point. (arXiv:2112.11079v7 [stat.ME] UPDATED)
    Suppose we observe a random vector $X$ from some distribution $P$ in a known family with unknown parameters. We ask the following question: when is it possible to split $X$ into two parts $f(X)$ and $g(X)$ such that neither part is sufficient to reconstruct $X$ by itself, but both together can recover $X$ fully, and the joint distribution of $(f(X),g(X))$ is tractable? As one example, if $X=(X_1,\dots,X_n)$ and $P$ is a product distribution, then for any $m<n$, we can split the sample to define $f(X)=(X_1,\dots,X_m)$ and $g(X)=(X_{m+1},\dots,X_n)$. Rasines and Young (2022) offers an alternative route of accomplishing this task through randomization of $X$ with additive Gaussian noise which enables post-selection inference in finite samples for Gaussian distributed data and asymptotically for non-Gaussian additive models. In this paper, we offer a more general methodology for achieving such a split in finite samples by borrowing ideas from Bayesian inference to yield a (frequentist) solution that can be viewed as a continuous analog of data splitting. We call our method data fission, as an alternative to data splitting, data carving and p-value masking. We exemplify the method on a few prototypical applications, such as post-selection inference for trend filtering and other regression problems.  ( 2 min )
    Quantum machine learning beyond kernel methods. (arXiv:2110.13162v3 [quant-ph] UPDATED)
    Machine learning algorithms based on parametrized quantum circuits are prime candidates for near-term applications on noisy quantum computers. In this direction, various types of quantum machine learning models have been introduced and studied extensively. Yet, our understanding of how these models compare, both mutually and to classical models, remains limited. In this work, we identify a constructive framework that captures all standard models based on parametrized quantum circuits: that of linear quantum models. In particular, we show using tools from quantum information theory how data re-uploading circuits, an apparent outlier of this framework, can be efficiently mapped into the simpler picture of linear models in quantum Hilbert spaces. Furthermore, we analyze the experimentally-relevant resource requirements of these models in terms of qubit number and amount of data needed to learn. Based on recent results from classical machine learning, we prove that linear quantum models must utilize exponentially more qubits than data re-uploading models in order to solve certain learning tasks, while kernel methods additionally require exponentially more data points. Our results provide a more comprehensive view of quantum machine learning models as well as insights on the compatibility of different models with NISQ constraints.
    Sequential Predictive Conformal Inference for Time Series. (arXiv:2212.03463v2 [stat.ML] UPDATED)
    We present a new distribution-free conformal prediction algorithm for sequential data (e.g., time series), called the \textit{sequential predictive conformal inference} (\texttt{SPCI}). We specifically account for the nature that time series data are non-exchangeable, and thus many existing conformal prediction algorithms are not applicable. The main idea is to exploit the temporal dependence of non-conformity scores (e.g., prediction residuals); thus, the past residuals contain information about future ones. Then we cast the problem of conformal prediction interval as predicting the quantile of a future residual, given a user-specified point prediction algorithm. Theoretically, we establish asymptotic valid conditional coverage upon extending consistency analyses in quantile regression. Using simulation and real-data experiments, we demonstrate a significant reduction in interval width of \texttt{SPCI} compared to other existing methods under the desired empirical coverage.
    Simplicity Bias in 1-Hidden Layer Neural Networks. (arXiv:2302.00457v1 [cs.LG])
    Recent works have demonstrated that neural networks exhibit extreme simplicity bias(SB). That is, they learn only the simplest features to solve a task at hand, even in the presence of other, more robust but more complex features. Due to the lack of a general and rigorous definition of features, these works showcase SB on semi-synthetic datasets such as Color-MNIST, MNIST-CIFAR where defining features is relatively easier. In this work, we rigorously define as well as thoroughly establish SB for one hidden layer neural networks. More concretely, (i) we define SB as the network essentially being a function of a low dimensional projection of the inputs (ii) theoretically, we show that when the data is linearly separable, the network primarily depends on only the linearly separable ($1$-dimensional) subspace even in the presence of an arbitrarily large number of other, more complex features which could have led to a significantly more robust classifier, (iii) empirically, we show that models trained on real datasets such as Imagenette and Waterbirds-Landbirds indeed depend on a low dimensional projection of the inputs, thereby demonstrating SB on these datasets, iv) finally, we present a natural ensemble approach that encourages diversity in models by training successive models on features not used by earlier models, and demonstrate that it yields models that are significantly more robust to Gaussian noise.
    Optimal Learning of Deep Random Networks of Extensive-width. (arXiv:2302.00375v1 [stat.ML])
    We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We derive a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We contrast these Bayes-optimal errors with the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
    Probabilistic Neural Data Fusion for Learning from an Arbitrary Number of Multi-fidelity Data Sets. (arXiv:2301.13271v1 [cs.LG] CROSS LISTED)
    In many applications in engineering and sciences analysts have simultaneous access to multiple data sources. In such cases, the overall cost of acquiring information can be reduced via data fusion or multi-fidelity (MF) modeling where one leverages inexpensive low-fidelity (LF) sources to reduce the reliance on expensive high-fidelity (HF) data. In this paper, we employ neural networks (NNs) for data fusion in scenarios where data is very scarce and obtained from an arbitrary number of sources with varying levels of fidelity and cost. We introduce a unique NN architecture that converts MF modeling into a nonlinear manifold learning problem. Our NN architecture inversely learns non-trivial (e.g., non-additive and non-hierarchical) biases of the LF sources in an interpretable and visualizable manifold where each data source is encoded via a low-dimensional distribution. This probabilistic manifold quantifies model form uncertainties such that LF sources with small bias are encoded close to the HF source. Additionally, we endow the output of our NN with a parametric distribution not only to quantify aleatoric uncertainties, but also to reformulate the network's loss function based on strictly proper scoring rules which improve robustness and accuracy on unseen HF data. Through a set of analytic and engineering examples, we demonstrate that our approach provides a high predictive power while quantifying various sources uncertainties.
    Robust Fitted-Q-Evaluation and Iteration under Sequentially Exogenous Unobserved Confounders. (arXiv:2302.00662v1 [stat.ML])
    Offline reinforcement learning is important in domains such as medicine, economics, and e-commerce where online experimentation is costly, dangerous or unethical, and where the true model is unknown. However, most methods assume all covariates used in the behavior policy's action decisions are observed. This untestable assumption may be incorrect. We study robust policy evaluation and policy optimization in the presence of unobserved confounders. We assume the extent of possible unobserved confounding can be bounded by a sensitivity model, and that the unobserved confounders are sequentially exogenous. We propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses closed-form solutions of the robust Bellman operator to derive a loss minimization problem for the robust Q function. Our algorithm enjoys the computational ease of fitted-Q-iteration and statistical improvements (reduced dependence on quantile estimation error) from orthogonalization. We provide sample complexity bounds, insights, and show effectiveness in simulations.
    Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets. (arXiv:2010.12909v3 [cs.LG] UPDATED)
    We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN solutions, even with SGD. This demonstrates its potential applications in learning neural networks amenable to pruning.
    Generative methods for sampling transition paths in molecular dynamics. (arXiv:2205.02818v2 [stat.ML] UPDATED)
    Molecular systems often remain trapped for long times around some local minimum of the potential energy function, before switching to another one -- a behavior known as metastability. Simulating transition paths linking one metastable state to another one is difficult by direct numerical methods. In view of the promises of machine learning techniques, we explore in this work two approaches to more efficiently generate transition paths: sampling methods based on generative models such as variational autoencoders, and importance sampling methods based on reinforcement learning.
    Stream-based active learning with linear models. (arXiv:2207.09874v3 [stat.ML] UPDATED)
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.
    Additive Higher-Order Factorization Machines. (arXiv:2205.14515v2 [stat.CO] UPDATED)
    In the age of big data and interpretable machine learning, approaches need to work at scale and at the same time allow for a clear mathematical understanding of the method's inner workings. While there exist inherently interpretable semi-parametric regression techniques for large-scale applications to account for non-linearity in the data, their model complexity is still often restricted. One of the main limitations are missing interactions in these models, which are not included for the sake of better interpretability, but also due to untenable computational costs. To address this shortcoming, we derive a scalable high-order tensor product spline model using a factorization approach. Our method allows to include all (higher-order) interactions of non-linear feature effects while having computational costs proportional to a model without interactions. We prove both theoretically and empirically that our methods scales notably better than existing approaches, derive meaningful penalization schemes and also discuss further theoretical aspects. We finally investigate predictive and estimation performance both with synthetic and real data.
    Posterior Sampling for Continuing Environments. (arXiv:2211.15931v2 [cs.LG] UPDATED)
    We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
    Sliced Optimal Partial Transport. (arXiv:2212.08049v4 [cs.LG] UPDATED)
    Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration.
    Predicting CSI Sequences With Attention-Based Neural Networks. (arXiv:2302.00341v1 [stat.ML])
    In this work, we consider the problem of multi-step channel prediction in wireless communication systems. In existing works, autoregressive (AR) models are either replaced or combined with feed-forward neural networks(NNs) or, alternatively, with recurrent neural networks (RNNs). This paper explores the possibility of using sequence-to-sequence (Seq2Seq) and transformer neural network (TNN) models for channel state information (CSI) prediction. Simulation results show that both, Seq2Seq and TNNs, represent an appealing alternative to RNNs and feed-forward NNs in the context of CSI prediction. Additionally, the TNN with a few adaptations can extrapolate better than other models to CSI sequences that are either shorter or longer than the ones the model saw during training.
    Deterministic equivalent and error universality of deep random features learning. (arXiv:2302.00401v1 [stat.ML])
    This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
    Variational Causal Inference. (arXiv:2209.05935v2 [stat.ML] UPDATED)
    Estimating an individual's potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, impulse responses, human faces) and covariates are relatively limited. In this case, to construct one's outcome under a counterfactual treatment, it is crucial to leverage individual information contained in its observed factual outcome on top of the covariates. We propose a deep variational Bayesian framework that rigorously integrates two main sources of information for outcome construction under a counterfactual treatment: one source is the individual features embedded in the high-dimensional factual outcome; the other source is the response distribution of similar subjects (subjects with the same covariates) that factually received this treatment of interest.
    Distribution free optimality intervals for clustering. (arXiv:2107.14442v2 [stat.ML] UPDATED)
    We address the problem of validating the ouput of clustering algorithms. Given data $\mathcal{D}$ and a partition $\mathcal{C}$ of these data into $K$ clusters, when can we say that the clusters obtained are correct or meaningful for the data? This paper introduces a paradigm in which a clustering $\mathcal{C}$ is considered meaningful if it is good with respect to a loss function such as the K-means distortion, and stable, i.e. the only good clustering up to small perturbations. Furthermore, we present a generic method to obtain post-inference guarantees of near-optimality and stability for a clustering $\mathcal{C}$. The method can be instantiated for a variety of clustering criteria (also called loss functions) for which convex relaxations exist. Obtaining the guarantees amounts to solving a convex optimization problem. We demonstrate the practical relevance of this method by obtaining guarantees for the K-means and the Normalized Cut clustering criteria on realistic data sets. We also prove that asymptotic instability implies finite sample instability w.h.p., allowing inferences about the population clusterability from a sample. The guarantees do not depend on any distributional assumptions, but they depend on the data set $\mathcal{D}$ admitting a stable clustering.
    Approximate Bayesian Computation with Path Signatures. (arXiv:2106.12555v2 [stat.ME] UPDATED)
    Simulation models often lack tractable likelihood functions, making likelihood-free inference methods indispensable. Approximate Bayesian computation generates likelihood-free posterior samples by comparing simulated and observed data through some distance measure, but existing approaches are often poorly suited to time series simulators, for example due to an independent and identically distributed data assumption. In this paper, we propose to use path signatures in approximate Bayesian computation to handle the sequential nature of time series. We provide theoretical guarantees on the resultant posteriors and demonstrate competitive Bayesian parameter inference for simulators generating univariate, multivariate, irregularly spaced, and even non-Euclidean sequences.
    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v2 [cs.LG] UPDATED)
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.
    How to select predictive models for causal inference?. (arXiv:2302.00370v1 [stat.ML])
    Predictive models -- as with machine learning -- can underpin causal inference, to estimate the effects of an intervention at the population or individual level. This opens the door to a plethora of models, useful to match the increasing complexity of health data, but also the Pandora box of model selection: which of these models yield the most valid causal estimates? Classic machine-learning cross-validation procedures are not directly applicable. Indeed, an appropriate selection procedure for causal inference should equally weight both outcome errors for each individual, treated or not treated, whereas one outcome may be seldom observed for a sub-population. We study how more elaborate risks benefit causal model selection. We show theoretically that simple risks are brittle to weak overlap between treated and non-treated individuals as well as to heterogeneous errors between populations. Rather a more elaborate metric, the R-risk appears as a proxy of the oracle error on causal estimates, observable at the cost of an overlap re-weighting. As the R-risk is defined not only from model predictions but also by using the conditional mean outcome and the treatment probability, using it for model selection requires adapting cross validation. Extensive experiments show that the resulting procedure gives the best causal model selection.
    Learning Equilibria in Matching Markets from Bandit Feedback. (arXiv:2108.08843v2 [cs.LG] UPDATED)
    Large-scale, two-sided matching platforms must find market outcomes that align with user preferences while simultaneously learning these preferences from data. Classical notions of stability (Gale and Shapley, 1962; Shapley and Shubik, 1971) are unfortunately of limited value in the learning setting, given that preferences are inherently uncertain and destabilizing while they are being learned. To bridge this gap, we develop a framework and algorithms for learning stable market outcomes under uncertainty. Our primary setting is matching with transferable utilities, where the platform both matches agents and sets monetary transfers between them. We design an incentive-aware learning objective that captures the distance of a market outcome from equilibrium. Using this objective, we analyze the complexity of learning as a function of preference structure, casting learning as a stochastic multi-armed bandit problem. Algorithmically, we show that "optimism in the face of uncertainty," the principle underlying many bandit algorithms, applies to a primal-dual formulation of matching with transfers and leads to near-optimal regret bounds. Our work takes a first step toward elucidating when and how stable matchings arise in large, data-driven marketplaces.
    Incorporating Sum Constraints into Multitask Gaussian Processes. (arXiv:2202.01793v3 [stat.ML] UPDATED)
    Machine learning models can be improved by adapting them to respect existing background knowledge. In this paper we consider multitask Gaussian processes, with background knowledge in the form of constraints that require a specific sum of the outputs to be constant. This is achieved by conditioning the prior distribution on the constraint fulfillment. The approach allows for both linear and nonlinear constraints. We demonstrate that the constraints are fulfilled with high precision and that the construction can improve the overall prediction accuracy as compared to the standard Gaussian process.
    Automatically Marginalized MCMC in Probabilistic Programming. (arXiv:2302.00564v1 [cs.LG])
    Hamiltonian Monte Carlo (HMC) is a powerful algorithm to sample latent variables from Bayesian models. The advent of probabilistic programming languages (PPLs) frees users from writing inference algorithms and lets users focus on modeling. However, many models are difficult for HMC to solve directly, which often require tricks like model reparameterization. We are motivated by the fact that many of those models could be simplified by marginalization. We propose to use automatic marginalization as part of the sampling process using HMC in a graphical model extracted from a PPL, which substantially improves sampling from real-world hierarchical models.
    Offline Estimation of Controlled Markov Chains: Minimaxity and Sample Complexity. (arXiv:2211.07092v3 [stat.ML] UPDATED)
    In this work, we study a natural nonparametric estimator of the transition probability matrices of a finite controlled Markov chain. We consider an offline setting with a fixed dataset, collected using a so-called logging policy. We develop sample complexity bounds for the estimator and establish conditions for minimaxity. Our statistical bounds depend on the logging policy through its mixing properties. We show that achieving a particular statistical risk bound involves a subtle and interesting trade-off between the strength of the mixing properties and the number of samples. We demonstrate the validity of our results under various examples, such as ergodic Markov chains, weakly ergodic inhomogeneous Markov chains, and controlled Markov chains with non-stationary Markov, episodic, and greedy controls. Lastly, we use these sample complexity bounds to establish concomitant ones for offline evaluation of stationary Markov control policies.
    A Group-Equivariant Autoencoder for Identifying Spontaneously Broken Symmetries. (arXiv:2202.06319v2 [cond-mat.stat-mech] UPDATED)
    We introduce the group-equivariant autoencoder (GE-autoencoder) -- a deep neural network (DNN) method that locates phase boundaries by determining which symmetries of the Hamiltonian have spontaneously broken at each temperature. We use group theory to deduce which symmetries of the system remain intact in all phases, and then use this information to constrain the parameters of the GE-autoencoder such that the encoder learns an order parameter invariant to these ``never-broken'' symmetries. This procedure produces a dramatic reduction in the number of free parameters such that the GE-autoencoder size is independent of the system size. We include symmetry regularization terms in the loss function of the GE-autoencoder so that the learned order parameter is also equivariant to the remaining symmetries of the system. By examining the group representation by which the learned order parameter transforms, we are then able to extract information about the associated spontaneous symmetry breaking. We test the GE-autoencoder on the 2D classical ferromagnetic and antiferromagnetic Ising models, finding that the GE-autoencoder (1) accurately determines which symmetries have spontaneously broken at each temperature; (2) estimates the critical temperature in the thermodynamic limit with greater accuracy, robustness, and time-efficiency than a symmetry-agnostic baseline autoencoder; and (3) detects the presence of an external symmetry-breaking magnetic field with greater sensitivity than the baseline method. Finally, we describe various key implementation details, including a new method for extracting the critical temperature estimate from trained autoencoders and calculations of the DNN initialization and learning rate settings required for fair model comparisons.
    On the Within-Group Discrimination of Screening Classifiers. (arXiv:2302.00025v1 [cs.LG])
    Screening classifiers are increasingly used to identify qualified candidates in a variety of selection processes. In this context, it has been recently shown that, if a classifier is calibrated, one can identify the smallest set of candidates which contains, in expectation, a desired number of qualified candidates using a threshold decision rule. This lends support to focusing on calibration as the only requirement for screening classifiers. In this paper, we argue that screening policies that use calibrated classifiers may suffer from an understudied type of within-group discrimination -- they may discriminate against qualified members within demographic groups of interest. Further, we argue that this type of discrimination can be avoided if classifiers satisfy within-group monotonicity, a natural monotonicity property within each of the groups. Then, we introduce an efficient post-processing algorithm based on dynamic programming to minimally modify a given calibrated classifier so that its probability estimates satisfy within-group monotonicity. We validate our algorithm using US Census survey data and show that within-group monotonicity can be often achieved at a small cost in terms of prediction granularity and shortlist size.
    A Fast, Well-Founded Approximation to the Empirical Neural Tangent Kernel. (arXiv:2206.12543v2 [stat.ML] UPDATED)
    Empirical neural tangent kernels (eNTKs) can provide a good understanding of a given network's representation: they are often far less expensive to compute and applicable more broadly than infinite width NTKs. For networks with O output units (e.g. an O-class classifier), however, the eNTK on N inputs is of size $NO \times NO$, taking $O((NO)^2)$ memory and up to $O((NO)^3)$ computation. Most existing applications have therefore used one of a handful of approximations yielding $N \times N$ kernel matrices, saving orders of magnitude of computation, but with limited to no justification. We prove that one such approximation, which we call "sum of logits", converges to the true eNTK at initialization for any network with a wide final "readout" layer. Our experiments demonstrate the quality of this approximation for various uses across a range of settings.
    $\texttt{DoCoFL}$: Downlink Compression for Cross-Device Federated Learning. (arXiv:2302.00543v1 [cs.LG])
    Many compression techniques have been proposed to reduce the communication overhead of Federated Learning training procedures. However, these are typically designed for compressing model updates, which are expected to decay throughout training. As a result, such methods are inapplicable to downlink (i.e., from the parameter server to clients) compression in the cross-device setting, where heterogeneous clients $\textit{may appear only once}$ during training and thus must download the model parameters. In this paper, we propose a new framework ($\texttt{DoCoFL}$) for downlink compression in the cross-device federated learning setting. Importantly, $\texttt{DoCoFL}$ can be seamlessly combined with many uplink compression schemes, rendering it suitable for bi-directional compression. Through extensive evaluation, we demonstrate that $\texttt{DoCoFL}$ offers significant bi-directional bandwidth reduction while achieving competitive accuracy to that of $\texttt{FedAvg}$ without compression.
    Width and Depth Limits Commute in Residual Networks. (arXiv:2302.00453v1 [stat.ML])
    We show that taking the width and depth to infinity in a deep neural network with skip connections, when branches are scaled by $1/\sqrt{depth}$ (the only nontrivial scaling), result in the same covariance structure no matter how that limit is taken. This explains why the standard infinite-width-then-depth approach provides practical insights even for networks with depth of the same order as width. We also demonstrate that the pre-activations, in this case, have Gaussian distributions which has direct applications in Bayesian deep learning. We conduct extensive simulations that show an excellent match with our theoretical findings.
    Delayed Feedback in Kernel Bandits. (arXiv:2302.00392v1 [stat.ML])
    Black box optimisation of an unknown function from expensive and noisy evaluations is a ubiquitous problem in machine learning, academic research and industrial production. An abstraction of the problem can be formulated as a kernel based bandit problem (also known as Bayesian optimisation), where a learner aims at optimising a kernelized function through sequential noisy observations. The existing work predominantly assumes feedback is immediately available; an assumption which fails in many real world situations, including recommendation systems, clinical trials and hyperparameter tuning. We consider a kernel bandit problem under stochastically delayed feedback, and propose an algorithm with $\tilde{\mathcal{O}}(\sqrt{\Gamma_k(T)T}+\mathbb{E}[\tau])$ regret, where $T$ is the number of time steps, $\Gamma_k(T)$ is the maximum information gain of the kernel with $T$ observations, and $\tau$ is the delay random variable. This represents a significant improvement over the state of the art regret bound of $\tilde{\mathcal{O}}(\Gamma_k(T)\sqrt{T}+\mathbb{E}[\tau]\Gamma_k(T))$ reported in Verma et al. (2022). In particular, for very non-smooth kernels, the information gain grows almost linearly in time, trivializing the existing results. We also validate our theoretical results with simulations.
    Distributed sequential federated learning. (arXiv:2302.00107v1 [stat.ML])
    The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.  ( 2 min )
    Whats Missing? Learning Hidden Markov Models When the Locations of Missing Observations are Unknown. (arXiv:2203.06527v2 [stat.ML] UPDATED)
    The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.  ( 2 min )
    Diffusion-based Image Translation using Disentangled Style and Content Representation. (arXiv:2209.15264v2 [cs.CV] UPDATED)
    Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.  ( 2 min )
    Filtering Context Mitigates Scarcity and Selection Bias in Political Ideology Prediction. (arXiv:2302.00239v1 [cs.LG])
    We propose a novel supervised learning approach for political ideology prediction (PIP) that is capable of predicting out-of-distribution inputs. This problem is motivated by the fact that manual data-labeling is expensive, while self-reported labels are often scarce and exhibit significant selection bias. We propose a novel statistical model that decomposes the document embeddings into a linear superposition of two vectors; a latent neutral \emph{context} vector independent of ideology, and a latent \emph{position} vector aligned with ideology. We train an end-to-end model that has intermediate contextual and positional vectors as outputs. At deployment time, our model predicts labels for input documents by exclusively leveraging the predicted positional vectors. On two benchmark datasets we show that our model is capable of outputting predictions even when trained with as little as 5\% biased data, and is significantly more accurate than the state-of-the-art. Through crowd-sourcing we validate the neutrality of contextual vectors, and show that context filtering results in ideological concentration, allowing for prediction on out-of-distribution examples.  ( 2 min )
    Adaptive sparseness for correntropy-based robust regression via automatic relevance determination. (arXiv:2302.00082v1 [cs.LG])
    Sparseness and robustness are two important properties for many machine learning scenarios. In the present study, regarding the maximum correntropy criterion (MCC) based robust regression algorithm, we investigate to integrate the MCC method with the automatic relevance determination (ARD) technique in a Bayesian framework, so that MCC-based robust regression could be implemented with adaptive sparseness. To be specific, we use an inherent noise assumption from the MCC to derive an explicit likelihood function, and realize the maximum a posteriori (MAP) estimation with the ARD prior by variational Bayesian inference. Compared to the existing robust and sparse L1-regularized MCC regression, the proposed MCC-ARD regression can eradicate the troublesome tuning for the regularization hyper-parameter which controls the regularization strength. Further, MCC-ARD achieves superior prediction performance and feature selection capability than L1-regularized MCC, as demonstrated by a noisy and high-dimensional simulation study.  ( 2 min )
    The Parametric Stability of Well-separated Spherical Gaussian Mixtures. (arXiv:2302.00242v1 [stat.ML])
    We quantify the parameter stability of a spherical Gaussian Mixture Model (sGMM) under small perturbations in distribution space. Namely, we derive the first explicit bound to show that for a mixture of spherical Gaussian $P$ (sGMM) in a pre-defined model class, all other sGMM close to $P$ in this model class in total variation distance has a small parameter distance to $P$. Further, this upper bound only depends on $P$. The motivation for this work lies in providing guarantees for fitting Gaussian mixtures; with this aim in mind, all the constants involved are well defined and distribution free conditions for fitting mixtures of spherical Gaussians. Our results tighten considerably the existing computable bounds, and asymptotically match the known sharp thresholds for this problem.  ( 2 min )
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v2 [cond-mat.stat-mech] UPDATED)
    Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have the advantages of a principle understanding of the expressive power using entanglement properties, and as a bridge connecting classical computation and quantum computation. Despite the great potential, however, existing tensor network models for unsupervised machine learning only work as a proof of principle, as their performance is much worse than the standard models such as restricted Boltzmann machines and neural networks. In this Letter, we present autoregressive matrix product states (AMPS), a tensor network model combining matrix product states from quantum many-body physics and autoregressive modeling from machine learning. Our model enjoys the exact calculation of normalized probability and unbiased sampling. We demonstrate the performance of our model using two applications, generative modeling on synthetic and real-world data, and reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor network models and the restricted Boltzmann machines, and is competitive with state-of-the-art neural network models.  ( 2 min )
    Revisiting Bellman Errors for Offline Model Selection. (arXiv:2302.00141v1 [cs.LG])
    Offline model selection (OMS), that is, choosing the best policy from a set of many policies given only logged data, is crucial for applying offline RL in real-world settings. One idea that has been extensively explored is to select policies based on the mean squared Bellman error (MSBE) of the associated Q-functions. However, previous work has struggled to obtain adequate OMS performance with Bellman errors, leading many researchers to abandon the idea. Through theoretical and empirical analyses, we elucidate why previous work has seen pessimistic results with Bellman errors and identify conditions under which OMS algorithms based on Bellman errors will perform well. Moreover, we develop a new estimator of the MSBE that is more accurate than prior methods and obtains impressive OMS performance on diverse discrete control tasks, including Atari games. We open-source our data and code to enable researchers to conduct OMS experiments more easily.  ( 2 min )
    Local transfer learning from one data space to another. (arXiv:2302.00160v1 [cs.LG])
    A fundamental problem in manifold learning is to approximate a functional relationship in a data chosen randomly from a probability distribution supported on a low dimensional sub-manifold of a high dimensional ambient Euclidean space. The manifold is essentially defined by the data set itself and, typically, designed so that the data is dense on the manifold in some sense. The notion of a data space is an abstraction of a manifold encapsulating the essential properties that allow for function approximation. The problem of transfer learning (meta-learning) is to use the learning of a function on one data set to learn a similar function on a new data set. In terms of function approximation, this means lifting a function on one data space (the base data space) to another (the target data space). This viewpoint enables us to connect some inverse problems in applied mathematics (such as inverse Radon transform) with transfer learning. In this paper we examine the question of such lifting when the data is assumed to be known only on a part of the base data space. We are interested in determining subsets of the target data space on which the lifting can be defined, and how the local smoothness of the function and its lifting are related.  ( 2 min )
    A Nearly-Optimal Bound for Fast Regression with $\ell_\infty$ Guarantee. (arXiv:2302.00248v1 [cs.DS])
    Given a matrix $A\in \mathbb{R}^{n\times d}$ and a vector $b\in \mathbb{R}^n$, we consider the regression problem with $\ell_\infty$ guarantees: finding a vector $x'\in \mathbb{R}^d$ such that $ \|x'-x^*\|_\infty \leq \frac{\epsilon}{\sqrt{d}}\cdot \|Ax^*-b\|_2\cdot \|A^\dagger\|$ where $x^*=\arg\min_{x\in \mathbb{R}^d}\|Ax-b\|_2$. One popular approach for solving such $\ell_2$ regression problem is via sketching: picking a structured random matrix $S\in \mathbb{R}^{m\times n}$ with $m\ll n$ and $SA$ can be quickly computed, solve the ``sketched'' regression problem $\arg\min_{x\in \mathbb{R}^d} \|SAx-Sb\|_2$. In this paper, we show that in order to obtain such $\ell_\infty$ guarantee for $\ell_2$ regression, one has to use sketching matrices that are dense. To the best of our knowledge, this is the first user case in which dense sketching matrices are necessary. On the algorithmic side, we prove that there exists a distribution of dense sketching matrices with $m=\epsilon^{-2}d\log^3(n/\delta)$ such that solving the sketched regression problem gives the $\ell_\infty$ guarantee, with probability at least $1-\delta$. Moreover, the matrix $SA$ can be computed in time $O(nd\log n)$. Our row count is nearly-optimal up to logarithmic factors, and significantly improves the result in [Price, Song and Woodruff, ICALP'17], in which a super-linear in $d$ rows, $m=\Omega(\epsilon^{-2}d^{1+\gamma})$ for $\gamma=\Theta(\sqrt{\frac{\log\log n}{\log d}})$ is required. We also develop a novel analytical framework for $\ell_\infty$ guarantee regression that utilizes the Oblivious Coordinate-wise Embedding (OCE) property introduced in [Song and Yu, ICML'21]. Our analysis is arguably much simpler and more general than [Price, Song and Woodruff, ICALP'17], and it extends to dense sketches for tensor product of vectors.  ( 2 min )
    Implicit Regularization Leads to Benign Overfitting for Sparse Linear Regression. (arXiv:2302.00257v1 [cs.LG])
    In deep learning, often the training process finds an interpolator (a solution with 0 training loss), but the test loss is still low. This phenomenon, known as benign overfitting, is a major mystery that received a lot of recent attention. One common mechanism for benign overfitting is implicit regularization, where the training process leads to additional properties for the interpolator, often characterized by minimizing certain norms. However, even for a simple sparse linear regression problem $y = \beta^{*\top} x +\xi$ with sparse $\beta^*$, neither minimum $\ell_1$ or $\ell_2$ norm interpolator gives the optimal test loss. In this work, we give a different parametrization of the model which leads to a new implicit regularization effect that combines the benefit of $\ell_1$ and $\ell_2$ interpolators. We show that training our new model via gradient descent leads to an interpolator with near-optimal test loss. Our result is based on careful analysis of the training dynamics and provides another example of implicit regularization effect that goes beyond norm minimization.  ( 2 min )
    Quickest Change Detection for Unnormalized Statistical Models. (arXiv:2302.00250v1 [stat.ML])
    Classical quickest change detection algorithms require modeling pre-change and post-change distributions. Such an approach may not be feasible for various machine learning models because of the complexity of computing the explicit distributions. Additionally, these methods may suffer from a lack of robustness to model mismatch and noise. This paper develops a new variant of the classical Cumulative Sum (CUSUM) algorithm for the quickest change detection. This variant is based on Fisher divergence and the Hyv\"arinen score and is called the Score-based CUSUM (SCUSUM) algorithm. The SCUSUM algorithm allows the applications of change detection for unnormalized statistical models, i.e., models for which the probability density function contains an unknown normalization constant. The asymptotic optimality of the proposed algorithm is investigated by deriving expressions for average detection delay and the mean running time to a false alarm. Numerical results are provided to demonstrate the performance of the proposed algorithm.  ( 2 min )
    Accelerated First-Order Optimization under Nonlinear Constraints. (arXiv:2302.00316v1 [math.OC])
    We exploit analogies between first-order algorithms for constrained optimization and non-smooth dynamical systems to design a new class of accelerated first-order algorithms for constrained optimization. Unlike Frank-Wolfe or projected gradients, these algorithms avoid optimization over the entire feasible set at each iteration. We prove convergence to stationary points even in a nonconvex setting and we derive rates for the convex setting. An important property of these algorithms is that constraints are expressed in terms of velocities instead of positions, which naturally leads to sparse, local and convex approximations of the feasible set (even if the feasible set is nonconvex). Thus, the complexity tends to grow mildly in the number of decision variables and in the number of constraints, which makes the algorithms suitable for machine learning applications. We apply our algorithms to a compressed sensing and a sparse regression problem, showing that we can treat nonconvex $\ell^p$ constraints ($p<1$) efficiently, while recovering state-of-the-art performance for $p=1$.  ( 2 min )
    Gradient Descent in Neural Networks as Sequential Learning in RKBS. (arXiv:2302.00205v1 [stat.ML])
    The study of Neural Tangent Kernels (NTKs) has provided much needed insight into convergence and generalization properties of neural networks in the over-parametrized (wide) limit by approximating the network using a first-order Taylor expansion with respect to its weights in the neighborhood of their initialization values. This allows neural network training to be analyzed from the perspective of reproducing kernel Hilbert spaces (RKHS), which is informative in the over-parametrized regime, but a poor approximation for narrower networks as the weights change more during training. Our goal is to extend beyond the limits of NTK toward a more general theory. We construct an exact power-series representation of the neural network in a finite neighborhood of the initial weights as an inner product of two feature maps, respectively from data and weight-step space, to feature space, allowing neural network training to be analyzed from the perspective of reproducing kernel {\em Banach} space (RKBS). We prove that, regardless of width, the training sequence produced by gradient descent can be exactly replicated by regularized sequential learning in RKBS. Using this, we present novel bound on uniform convergence where the iterations count and learning rate play a central role, giving new theoretical insight into neural network training.  ( 2 min )
    Training Normalizing Flows with the Precision-Recall Divergence. (arXiv:2302.00628v1 [cs.LG])
    Generative models can have distinct mode of failures like mode dropping and low quality samples, which cannot be captured by a single scalar metric. To address this, recent works propose evaluating generative models using precision and recall, where precision measures quality of samples and recall measures the coverage of the target distribution. Although a variety of discrepancy measures between the target and estimated distribution are used to train generative models, it is unclear what precision-recall trade-offs are achieved by various choices of the discrepancy measures. In this paper, we show that achieving a specified precision-recall trade-off corresponds to minimising -divergences from a family we call the {\em PR-divergences }. Conversely, any -divergence can be written as a linear combination of PR-divergences and therefore correspond to minimising a weighted precision-recall trade-off. Further, we propose a novel generative model that is able to train a normalizing flow to minimise any -divergence, and in particular, achieve a given precision-recall trade-off.  ( 2 min )
    Deep learning for $\psi$-weakly dependent processes. (arXiv:2302.00333v1 [stat.ML])
    In this paper, we perform deep neural networks for learning $\psi$-weakly dependent processes. Such weak-dependence property includes a class of weak dependence conditions such as mixing, association,$\cdots$ and the setting considered here covers many commonly used situations such as: regression estimation, time series prediction, time series classification,$\cdots$ The consistency of the empirical risk minimization algorithm in the class of deep neural networks predictors is established. We achieve the generalization bound and obtain a learning rate, which is less than $\mathcal{O}(n^{-1/\alpha})$, for all $\alpha > 2 $. Applications to binary time series classification and prediction in affine causal models with exogenous covariates are carried out. Some simulation results are provided, as well as an application to the US recession data.  ( 2 min )
    Robust online active learning. (arXiv:2302.00422v1 [stat.ML])
    In many industrial applications, obtaining labeled observations is not straightforward as it often requires the intervention of human experts or the use of expensive testing equipment. In these circumstances, active learning can be highly beneficial in suggesting the most informative data points to be used when fitting a model. Reducing the number of observations needed for model development alleviates both the computational burden required for training and the operational expenses related to labeling. Online active learning, in particular, is useful in high-volume production processes where the decision about the acquisition of the label for a data point needs to be taken within an extremely short time frame. However, despite the recent efforts to develop online active learning strategies, the behavior of these methods in the presence of outliers has not been thoroughly examined. In this work, we investigate the performance of online active linear regression in contaminated data streams. Our study shows that the currently available query strategies are prone to sample outliers, whose inclusion in the training set eventually degrades the predictive performance of the models. To address this issue, we propose a solution that bounds the search area of a conditional D-optimal algorithm and uses a robust estimator. Our approach strikes a balance between exploring unseen regions of the input space and protecting against outliers. Through numerical simulations, we show that the proposed method is effective in improving the performance of online active learning in the presence of outliers, thus expanding the potential applications of this powerful tool.  ( 2 min )
    The geometry of hidden representations of large transformer models. (arXiv:2302.00294v1 [cs.LG])
    Large transformers are powerful architectures for self-supervised analysis of data of various nature, ranging from protein sequences to text to images. In these models, the data representation in the hidden layers live in the same space, and the semantic structure of the dataset emerges by a sequence of functionally identical transformations between one representation and the next. We here characterize the geometric and statistical properties of these representations, focusing on the evolution of such proprieties across the layers. By analyzing geometric properties such as the intrinsic dimension (ID) and the neighbor composition we find that the representations evolve in a strikingly similar manner in transformers trained on protein language tasks and image reconstruction tasks. In the first layers, the data manifold expands, becoming high-dimensional, and then it contracts significantly in the intermediate layers. In the last part of the model, the ID remains approximately constant or forms a second shallow peak. We show that the semantic complexity of the dataset emerges at the end of the first peak. This phenomenon can be observed across many models trained on diverse datasets. Based on these observations, we suggest using the ID profile as an unsupervised proxy to identify the layers which are more suitable for downstream learning tasks.  ( 2 min )

  • Open

    Creating "Her" using GPT-3 & TTS trained on voice from movie
    submitted by /u/justLV [link] [comments]  ( 40 min )
    A look at Rewind.AI - the search engine for life?
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    How to make an AI learn osu!? (for school project year)
    Hello, I am currently in Highschool and we have a Project Year in computer science. I recently got really interested in AI and especially AI's that are able to learn how to play a specific game. You may or may not know Vedal987 but I am taking huge inspiration from him on how his AI "Neuro-sama" is able to play osu! I want to do something similar with the same game but, I really am a beginner with AI's of that kind so I wanted to ask if you have any learning materials for me or guides on how to start and all the stuff that I could need to actually execute my plan (and if it even is possible within a year) I am an expert in Python (also good in Java and Javascript), I can code and stuff but have no clue how to start out since I've never done this. Thanks in advance :) (feel free to ask me if you dont understand something, i am bad in expressing myself) submitted by /u/iLeg1999 [link] [comments]  ( 41 min )
    We built a mobile app powered by GPT, specialising in daily task assistance 🚀
    Hey folks! 👋 I am Ed, one of the creators of Toucan. Super excited to share what we've been working on 👇 Toucan is a mobile app that helps you get the most out of GPT-3 “on the go” – with a slick mobile UX, powerful wrap-around features and fine-tuned models. If you're keen to learn a bit more about our journey, you can read on below, of if you just want to throw your best queries at GPT, you can download Toucan on the App Store, here: https://apps.apple.com/us/app/toucan-ai-chatbot-assistant/id1665298806 🧩 Our story We started Toucan to make engaging with cutting-edge AI on mobile devices a vastly better experience. No more browser hopping, endless auth challenges, conversation loss or clunky UI interactions. We’ve also built handy supporting functionalities around the AI layer (s…  ( 42 min )
    1-click deploy for your GPT-3 App
    Link - https://github.com/ClerkieAI/berri_ai We made a package that makes it easy for you to quickly deploy your LLM Agent from Google Colab to production (Web App and API Endpoint). How it works? Just install the package, import the function, and run deploy. At the end of the deploy (~10-15mins), you will get: A web app to interact with your agent 👉 https://agent-repo-35aa2cf3-a0a1-4cf8-834f-302e5b7fe07e-4524... An endpoint you can query 👉 https://agent-repo-35aa2cf3-a0a1-4cf8-834f-302e5b7fe07e-4524... is obama?" Want a more detailed walkthrough? Check out our loom - https://www.loom.com/share/fd4375b4a77f4ea7802369cb06a16d43 We’re still early so would love your feedback and opinions. Feel free to try us out for free – and if you need help building an agent / want a specific integration, just let us know! https://i.redd.it/xu6a92464ufa1.gif submitted by /u/VideoTo [link] [comments]  ( 41 min )
    AI Dream 124 - Great Relaxing Psychedelic AI Video
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI tool for typography?
    I'm wondering if such a thing exists. Say I have the word "Eyeball" or anything at all. I want to find a font that makes the word "Eyeball" look good, while adding parameters such as - professional, business like, etc - possibly aiding in its suggestions. I already have been using Chat GPT to create concepts with font suggestions, but it would be neat if there was a dedicated tool for this. Is there currently any AI out there that can do that? submitted by /u/Pinkisacoloryes [link] [comments]  ( 41 min )
    is there an ai for helping YOU learn?
    Ive heard of all types of ai but is there one every conceived of helping you learn fast and quicker? submitted by /u/mrmaskfawkes [link] [comments]  ( 41 min )
    Best ChatGPT Alternatives for 2023
    submitted by /u/visimens-technology [link] [comments]  ( 40 min )
    Join the AI Art Revolution! Help Create a Masterpiece and Explore the Boundaries of Technology and Creativity.
    Hey reddit, I'm a student that is taking part in a contest focusing on banned works of art, and the relationship between art and society. After trying my hand at Chat GPT and DALLE2 I've spurred an idea for the topic of this contest. A couple details, the contest is in the form of a comic book page, it's surrounding the themes of obscenity, libel, codes, copyrights, etc. I was considering using AI as the subject matter considering the recent controversies surrounding this new technology. Questions such as whose art is it? The person using the technology, or the person owning it? Is it copyrightable considering it's fed data based on previous artists work? I'm relatively new to this space other than witnessing peoples uses on various social media. So given that, my idea was to have Chat GPT generate a fictional story surrounding the relationship of society and art, tweaking until I have something that can be applied into a comic book form, then feeding parameters into DALLE2 to create comic book panels. Given the controversial nature of utilizing AI for school related essays, homework, etc I was curious what this community thinks of this idea? I felt like it was one two punch, utilizing new technology to create artwork that's relevant to the topic at hand and also has the potential to be "banned" so to speak from participating in the contest. submitted by /u/KylesButler [link] [comments]  ( 42 min )
    AI is terrifying if you actually think about what we can do with it ... it made me question what it means to even be human
    submitted by /u/andytk33 [link] [comments]  ( 43 min )
    I Made a List of The 5 Best AI Detection Tools
    submitted by /u/HODLTID [link] [comments]  ( 40 min )
    📌[Searchcolab] League of Legends characters at 80's Dark Fantasy Movie. SafeTensor Link in comments.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    📌[Searchcolab] What is the reason behind the noticeable similarity in quality between the voice cloning results generated by Elevenlabs and Microsoft VALL-E? Unofficial implementation of Microsoft VALL-E present on Searchcolab.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    A.I Eminem - 'Raw Sh*t'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    The Moravec Paradox is Not Paradoxical: Intelligence from the Primordial Soup of Signs
    submitted by /u/SonntagMorgen [link] [comments]  ( 40 min )
    🌌 Why A.I. is Increasingly a Game Changer in Astronomy and Cosmology
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    Chinese Company Gearing Up to Release Powerful ChatGPT Competitor
    submitted by /u/Mental_Character7367 [link] [comments]  ( 44 min )
    Join the experiment: exploring a new approach to emotional support with the AI Friend/Counselor Bot
    I've been working on a project that uses the OpenAI API to create a friend/counselor bot. The goal of this bot is to provide emotional support and have meaningful conversations about a variety of topics including work, relationships, emotions, health, and more. If you're interested, you can check out the demo of the bot at https://friendbot.ai. I'm looking for alpha testers to help me improve the features such as handling long interactions, sematics etc. Participation in the experiment is anonymous and no data about interactions are stored on the servers. submitted by /u/M-K111 [link] [comments]  ( 41 min )
    AI writes video about itself. The Rise of AI Art: A Creative Revolution.
    submitted by /u/anekii [link] [comments]  ( 40 min )
    Write Faster SQL With AI
    submitted by /u/EloquentPickle [link] [comments]  ( 40 min )
    Pokémon as an 80’s Dark Fantasy Film (AI Generated Art)
    submitted by /u/HooverHooverHoober [link] [comments]  ( 40 min )
    Web 3.0:Era of condensed knowledge
    Yandex(Russian search engine),Google,Bing and Baidu race to release chatbots as replacement for traditional search. Here is what I think will happen: Most websites will die out. Why?You don't need 10000 recipe websites.You need culinary Wikipedia.And a chatbot to query it.99% percent of recipes are not unique.Culinary theory is pretty established.They can be condensed into branching stricture.e.g. Pasta with tomatoes.Pasta with cheese. Plumbing.Pretty much the same. Programming.All you need is documentation pretty much.You can even analyse open source projects and generate documentation on the fly.Tutorials and best practices can be condensed. News.I was thinking about extracting news from data collected from ambient computing.E.g. Google develops new feature.News article is generated from Todo lists,meeting notes,code.But it's too far out for now. And list goes on and on. You need 100 Wikipedia to contain all human knowledge.And even if websites will refuse to cooperate it doesn't matter.There is plenty of copies of the web. submitted by /u/nikitastaf1996 [link] [comments]  ( 44 min )
    AI's Effects on/Associations with cognitive processes, memory, and other importance in psychology
    Hello! Does anyone know of a decent scholarly article on something relevant to this that they could share with me? ​ Thanks! :) submitted by /u/EvilPeppermintHelix [link] [comments]  ( 40 min )
    How do you find new ai-based software/programs?
    There are numerous amazing AI-powered software/programs available, such as Wave2Lip on GitHub (which enables you to lip sync a video to any voice audio file), CascadeOur (which generates AI animations), or Google Film (which creates a seamless slow-motion effect from two distant frames). These programs possess tremendous potential, but how can I discover new ones? I'm sure there are many more out there that I am not aware of. Can you suggest any tips on how I can find them? submitted by /u/PM_ME_LIFE_MEANING [link] [comments]  ( 41 min )
  • Open

    Lagrange multiplier setup: Now what?
    Suppose you need to optimize, i.e. maximize or minimize, a function f(x). If this is a practical problem and not a textbook exercise, you probably need to optimize f(x) subject to some constraint on x, say g(x) = 0. Hmm. Optimize one function subject to a constraint given by another function. Oh yeah, Lagrange multipliers! […] Lagrange multiplier setup: Now what? first appeared on John D. Cook.  ( 6 min )
  • Open

    Predict football punt and kickoff return yards with fat-tailed distribution using GluonTS
    Today, the NFL is continuing their journey to increase the number of statistics provided by the Next Gen Stats Platform to all 32 teams and fans alike. With advanced analytics derived from machine learning (ML), the NFL is creating new ways to quantify football, and to provide fans with the tools needed to increase their […]  ( 10 min )
    Analyze and visualize multi-camera events using Amazon SageMaker Studio Lab
    The National Football League (NFL) is one of the most popular sports leagues in the United States and is the most valuable sports league in the world. The NFL, BioCore, and AWS are committed to advancing human understanding around the diagnosis, prevention, and treatment of sports-related injuries to make the game of football safer. More […]  ( 10 min )
  • Open

    Google Research, 2022 & beyond: ML & computer systems
    Posted by Phitchaya Mangpo Phothilimthana, Staff Research Scientist, and Adam Paszke, Staff Research Scientist, Google Research (This is Part 3 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.) Great machine learning (ML) research requires great systems. With the increasing sophistication of the algorithms and hardware in use today and with the scale at which they run, the complexity of the software necessary to carry out day-to-day tasks only increases. In this post, we provide an overview of the numerous advances made across Google this past year in systems for ML that enable us to support the serving and training of complex models while easing the complexity of implementation for end users. This b…  ( 97 min )
    Open Source Vizier: Towards reliable and flexible hyperparameter and blackbox optimization
    Posted by Xingyou (Richard) Song, Research Scientist, and Chansoo Lee, Software Engineer, Google Research, Brain Team Google Vizier is the de-facto system for blackbox optimization over objective functions and hyperparameters across Google, having serviced some of Google’s largest research efforts and optimized a wide range of products (e.g., Search, Ads, YouTube). For research, it has not only reduced language model latency for users, designed computer architectures, accelerated hardware, assisted protein discovery, and enhanced robotics, but also provided a reliable backend interface for users to search for neural architectures and evolve reinforcement learning algorithms. To operate at the scale of optimizing thousands of users’ critical systems and tuning millions of machine learning…  ( 91 min )
  • Open

    [D] I'm at a crossroads: Bayesian methods VS Reinforcement Learning, which to choose?
    I know it looks like a very absurd comparison. The reason why I ask this is because I'm taking a masters of ML(that wasn't exactly cheap) and while the basic concepts are covered in the mandatory path, now I have to choose an elective, in this case between "Bayesian Methods" and "Reinforcement Learning". So what I'd like to ask you is which of both concepts/techniques are more used or relevant to work in the Data Science industry, meaning which one is going to help me more in that job or make a more positive impression in my resume. Thx in advance! submitted by /u/fuscarili [link] [comments]  ( 46 min )
    [P] Domestic Violence Dataset
    Hi, I am working on project and for that I need a Twitter Domestic Violence Dataset. Basically I need a dataset with domestic violence tweets against woman. I have searched Kaggle and other websites but found no luck. Plus, I tried using Snscrape, but I need some phrases ideas related to domestic violence so I can get some tweets using that. I tried "Domestic Violence" , "My husband tried to kill me" and looking for more. Help is appreciated. submitted by /u/Naive-Aioli4849 [link] [comments]  ( 43 min )
    [D] Workflow chair for AI conference
    Hi! Does anyone here have experience working as a workflow chair for major conferences? What are the duties and how much does it pay? (I heard that it's a paid role) submitted by /u/Expensive-Track [link] [comments]  ( 42 min )
    [p] I built an open source platform to deploy computationally intensive Python functions as serverless jobs, with no timeouts
    Hi friends! I ran into this problem enough times at my last few jobs that I built a tool to solve it. I spent many hours building Docker containers for my Python functions, as many of the data science modules required building C libraries (since they significantly speed up compute-intensive routines, such as math calculations). Deploying the containers to AWS Lambda or Fargate (if the processes required more CPU or memory or were >15 minutes) and wiring functions to talk to each other using queues, databases, and blob storage made iterating on the actual code, which wasn't even that complex most of the time, slow. I made cakework https://github.com/usecakework/cakework, a platform that lets you spin up your Python functions as serverless, production-scale backends with a single command. Using the client SDK, you submit requests, check status, and get results. You can also specify the amount of CPU (up to 16 cores) and memory (up to 128GB) for each individual request, which is helpful when your data size and complexity varies across different requests. A common pattern that I built cakework for is doing file processing for ML: - ingest data from some source daily, or in response to an external event (data written to blob storage) - run my function (often using pandas/numpy/scipy) - write results to storage, update database - track failures and re-run/fix It's open source <3. Here are some fun examples to get you started: https://docs.cakework.com/examples Would love to hear your thoughts! submitted by /u/seattleite849 [link] [comments]  ( 47 min )
    [P] Time series outlier / anomaly detection
    I have traffic speed time series data for each day of the week over several months, with data samples about every 30 seconds. I'd like to find periods of time (subsequences) where the speed is much slower than usual. Any recommendations for algorithms that would be well suited to this problem? Thanks submitted by /u/dudester_el [link] [comments]  ( 44 min )
    [D] Querying with multiple vectors during embedding nearest neighbor search?
    Are there tools or techniques that permit you to joint query using more than one query vector? Use case: iterative ANN search refinement, where I start with a seed vector, select matches, and re-query with more examples to improve the search results. I tried doing this with FAISS, but it performs a "batch query" that returns a separate set of results for each query vector (not a joint query). submitted by /u/mostlyhydrogen [link] [comments]  ( 42 min )
    [D] Do high leverage points affect Neural Net and Tree-based model?
    I know they can affect linear regression badly but given the fact that neural net and tree-based models can approximate non-linear complex functions, I don't think the high leverage points would be a problem. Just curious about your opinion whether my thinking makes sense submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] ImageNet normalization vs [-1, 1] normalization
    For ImageNet classification, there are two common ways of normalizing the input images: - Normalize to [-1, 1] using an affine transformation (2*(x/255) - 1). - Normalize using ImageNet mean = (0.485, 0.456, 0.406) and std = (0.229, 0.224, 0.225). I observe that the first one is more common in TensorFlow codebases (including Jax models with TensorFlow data processing, e.g. the official Vision Transformers code), whereas the second is ubiquitous in PyTorch codebases. I tried to find empirical comparisons of the two, but there doesn't seem to be any. Which one is better in your opinion? I guess the performance shouldn't be too different, but still it's interesting to hear your experience. submitted by /u/netw0rkf10w [link] [comments]  ( 44 min )
    [D] PC takes a long time to execute code, possibility to use a cloud/external device?
    Hello people, I am currently attending a Data Science course and to finish I have to write a paper about a project that I am currently working on. I write the code in VSCode and I use .ipynb notebooks. So I am basically training a few ML models after a long data preprocessing which worked out fine. But as soon as I run my hyperparameter tuning code, my PC takes a lot of time. Right now I am running hyperparameter tuning for RandomForest and it already runs for 21 hours. Is there any possibility for me to run my code somewhere else? I read abour Heroku, but that seems to be too much than what I am looking for. I am getting a bit nervous, because I want to get this paper done. The worst case is that I have to buy a new PC. Thank you so much! submitted by /u/Emergency-Dig-5262 [link] [comments]  ( 45 min )
    [N] Microsoft integrates GPT 3.5 into Teams
    Official blog post: https://www.microsoft.com/en-us/microsoft-365/blog/2023/02/01/microsoft-teams-premium-cut-costs-and-add-ai-powered-productivity/ Given the amount of money they pumped into OpenAI, it's not surprising that you'd see it integrated into their products. I do wonder how this will work in highly regulated fields (finance, law, medicine, education). submitted by /u/bikeskata [link] [comments]  ( 47 min )
    [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples?
    Aligned LLMs such as InstructGPT and ChatGPT are trained via supervised fine-tuning after the initial self-supervised pretraining. Then, the researchers train a reward model on responses ranked by humans. When I understand correctly, they let the LLM generate responses that humans have to rank on a scale from 1-5. Then, they train a reward model (I suppose in supervised fashion?) on these ranked outputs. Once that's done, they use reinforcement learning (RL) with proximal policy optimization (PPO) to update the LLM. My question is why they use RL with PPO for this last step? Why don't they fine-tune the LLM using regular supervised learning, whereas the human-ranked outputs represent the labels. Since these are labels in the range 1-5, this could be a ranking or ordinal regression loss for supervised learning. submitted by /u/alpha-meta [link] [comments]  ( 47 min )
    [D] Inconsistent Featurespace in Data
    Hi colleagues! I am working on a model for which I have a dataset consisting of 2 data sources. Problem is that one datastream starts in 2017 and the other only in 2022. Feature spaces from those 2 data streams are different. I am wondering if there is a methodology to follow which allows me to use both data streams for training even though one starts way later than the other. Or am I forced to drop the newer one? (just 2022 data from two sources is too small for me to train on) Thank you! submitted by /u/pahalie [link] [comments]  ( 43 min )
    [D] Commercial Use of a Model that has been trained using Human3.6M
    I wanted to use the Learnable Trainangulation model in a commercial project. The source code itself is under MIT licensing. However, the dataset they have used is Human3.6M, which states that the license is "FREE OF CHARGE FOR ACADEMIC USE ONLY". Yet, recent court rulings (in the US) state that models can use copyrighted data during training, and the results are no longer bound by that copyright (e.g. Google Books). Does the same apply here? submitted by /u/mfarahmand98 [link] [comments]  ( 42 min )
    [D] Global Optimum of K-Means Cost Function
    I've recently started reading up on classical ML and I got a question about K-Means. More concretely, I am confused about the uniqueness of the global optimal solution of K-Means's cost function. Let's state the problem formally below, extracted from Bishop's Pattern Recognition and Machine Learning book, exercise 9.1. Consider the 𝐾-means algorithm discussed in Section 9.1. Show that as a consequence of there being a finite number of possible assignments for the set of discrete indicator variables 𝑟𝑛𝑘, and that for each such assignment there is a unique optimum for the 𝝁𝑘, the K-means algorithm must converge after a finite number of iterations. I made an answer [here](https://stats.stackexchange.com/questions/603327/question-on-the-proof-of-convergence-of-k-means) detailing the proof of why it does converge in Lloyd's algorithm, but I think I still do not understand why Lloyd's do not converge to a global minimum, which mathematical theorem/understanding am I missing here? I think that optimizing both the assignments and the centroids of K-Means at the same time is non-convex and hence there are many local minimums, we can use brute force to search for the global minimum but of course it is exponential to the number of data points. On the other hand, Lloyd optimizes it (greedily) alternatively, and hence you will find the cost functions' local minima (guaranteed)? submitted by /u/healthymonkey100 [link] [comments]  ( 45 min )
    [P] [R] A simplistic UI to edit images with Stable Diffusion and InstructPix2Pix
    https://preview.redd.it/ut4us5251rfa1.png?width=2000&format=png&auto=webp&s=dbf79c3832b20287203faa97e5c1303472bdbc22 Currently, the UI supports a picture upload and uses InstructPix2Pix to edit it. Also, it uses upscaling models for quality enhancements. More models are coming soon. The goal is to provide a way for non-ML people to use diffusion-based image editing through simplistic app design. Web demo: https://diffground.com/ submitted by /u/radi-cho [link] [comments]  ( 43 min )
    [D]How Will Open Source Alternatives Compete With GPT3?
    To clarify, I'm not talking about ChatGPT here. I've been testing outputs from GPT-3 davinci003 against alternatives in terms of output quality, relevance, and ability to understand "instruct" (versus vanilla autocompletion). I tried these: AI21 Jurassic 178B NeoX 20B GPT J 6B FairSeq 13B As well as: GPT-3 davinci002 GPT-3 davinci001 Of course, I didn't expect the smaller models to be on par with GPT-3, but I was surprised at how much better GPT3 davinci 003 performed compared to AI21's 178B model. AI21's Jurassic 178B seems to be comparable to GPT3 davinci 001. Does this mean that only well-funded corporations will be able to train general-purpose LLMs? It seems to me that just having a large model doesn't do much, it's also about several iterations of training and feedback. How are open source alternatives going to be able to compete? (I'm not in the ML or CS field, just an amateur who enjoys using these models) submitted by /u/noellarkin [link] [comments]  ( 45 min )
    [R] Sentence autoencoder
    Any suggestion on sentence auto-encoder? I want to learn the vector representation of a sentence and reconstruct the sentence itself. I used plane LSTM with self attention in the encoder and decoder architecture (no cross attention) but the results are not that good enough. I can not use cross attention i.e decoder will not have access to all the outputs of encoder but only access to the bottleneck latent vector. BART have pre-trained in similar manner but I don't know if we can pre-train that model to fit my use case. This is just a module of other work, after pre-training the sentence-sentence auto-encoder, I need to add some more module in between them, so I should have encoder and decoder separable, which I think can not be done in BART as well. Any direction would be much appreciated. Thank you submitted by /u/Bishwa12 [link] [comments]  ( 43 min )
    [D] Apple's ane-transformers - experiences?
    I'm using Huggingface's transformers regularly for experimentations, but I plan to deploy some of the models to iOS. I have found ml-ane-transformers repo from Apple, which shows how transformers can be rewritten to have much better performance on Apple's devices. There's an example of DistilBERT implemented in that optimized way. As I plan to deploy transformers to iOS, I started thinking about this. I'm hoping some already have experience about this, so we can discuss: Has anyone tried this themselves? Do they actually see the improvements in performance on iOS? I'm using Huggingface's transformer models in my experiments. How much work do you think there is to rewrite model in this optimized way? It's very difficult to train transformers from scratch (especially if they're big :) ), so I'm fine-tuning on top of pre-trained models on Huggingface. Is it possible to use weights from pretrained Huggingface models with the Apple's reference code? How difficult is it? submitted by /u/alkibijad [link] [comments]  ( 43 min )
  • Open

    Best Reinforcement Learning Papers from the past 1-2 years
    I am searching for the best Reinforcement Learning Papers from the past 1-2 years. My special focus is on Communications, but as we all know a lot of approaches can be applied in different fields. I would appreciate any recommendations :) submitted by /u/GolemX14 [link] [comments]  ( 41 min )
    Multi-Agent Stable Baselines
    I want to extend an implementation that currently uses stable baselines 3 from a single-agent into a multi-agent system. As far as I can tell, stable baselines isn't really suited for this. Does anyone have experience with multi-agent systems in stable baselines or with switching from stable baselines to RLlib? submitted by /u/tessherelurkingnow [link] [comments]  ( 41 min )
    "Distillation Policy Optimization", Ma et al 2023
    https://arxiv.org/abs/2302.00533 submitted by /u/OutOfCharm [link] [comments]  ( 41 min )
    Where do the new weights in PPO come from?
    I am looking at the pseudocode for PPO given in SpinningUp over here - ​ https://preview.redd.it/sx3kx08k2pfa1.png?width=1013&format=png&auto=webp&s=85511fe4cccda77975bd4731558f3936c9f5f522 ​ I understand that $\theta_k$ are the old weights. I am a little lost regarding the weight $\theta$. Is this the new weight? ​ Here is what I guess is happening (I could be completely wrong) - PPO is based on a generic Actor Critic Algorithm Therefore the new weight $\theta$ is learned as per conventional A-C algorithms The old weights $\theta_k$ and new weights $\theta$ are stored and then used in the clip objective The clip objective uses these 2 weights to come up with $\theta_{k+1}$ submitted by /u/Academic-Rent7800 [link] [comments]  ( 43 min )
    Noob question: why is this trivial problem not accordingly trivial to train? (PPO)
    I'm new to ML/RL and trying to build some intuition on how the fancy algorithms are made useful in practice. So far I was surprised how pairing SB3 + a gym environment frequently does nothing unless the right parameters are hit. So I've written a trivial mock environment: it "thinks" of a random target floating point number between -1 and 1 and a random state number in the same range. The observation (-2 to 2) is the difference between both. The (continous -1 to 1) action is added to the state The reward is the absolute value of the distance between current state and the target times -1, with a little negative bit added to "make the life painful" and enforce finishing Done is signaled once converged within 0.001, which is achievable within 1 to max. 2 steps depending on the reset state. So my thinking is: this is as easy as it gets - it does not matter how the algorithm explores or what it sees, the input to output relation is always exactly same. The actor simply needs to learn to forward the input to the output, which is seen learnable from every possible action. So it should train instantly with almost random choices of most hyperparameters - reward discount would not really matter, neither would most of the others, right? (I use PPO from SB3.) Yet I almost could not get it to work: it would behave wildly different with different n_env*n_steps sizes, learning and randomly unlearning again even with small learning_rates set. The cases I could find that work manually would take total_timesteps in a range of millions to finish consistently, and even then they would finish mostly in 2-4 steps rather then 1-2 in evaluation. Is such a problem really such sensitive or am I doing it wrong? Is there a way to actually train it within 10000s rather then millions of steps with PPO and if so, how to find from the train metrics? How could it be improved to converge more precisely? For anyone wanting to play around, code is attached below. submitted by /u/EvilButFluffy [link] [comments]  ( 45 min )
  • Open

    Three Cheers: GFN Thursday Celebrates Third Anniversary With 25 New Games
    Cheers to another year of cloud gaming! GeForce NOW celebrates its third anniversary with a look at how far cloud gaming has come, a community celebration and 25 new games supported in February. Members can celebrate all month long, starting with a sweet Dying Light 2 reward and support for nine more games this week, Read article >  ( 7 min )
    NVIDIA A100 Aces Throughput, Latency Results in Key Inference Benchmark for Financial Services Industry
    NVIDIA A100 Tensor Core GPUs running on Supermicro servers have captured leading results for inference in the latest STAC-ML Markets benchmark, a key technology performance gauge for the financial services industry. The results show NVIDIA demonstrating unrivaled throughput — serving up thousands of inferences per second on the most demanding models — and top latency Read article >  ( 6 min )
    Survey Reveals Financial Industry’s Top 4 AI Priorities for 2023
    For several years, NVIDIA has been working with some of the world’s leading financial institutions to develop and execute a wide range of rapidly evolving AI strategies. For the past three years, we’ve asked them to tell us collectively what’s on the top of their minds. Sometimes the results are just what we thought they’d Read article >  ( 6 min )
  • Open

    Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization. (arXiv:2210.12057v2 [cs.LG] UPDATED)
    We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
    Improving Score-based Diffusion Models by Enforcing the Underlying Score Fokker-Planck Equation. (arXiv:2210.04296v3 [cs.LG] UPDATED)
    Score-based generative models learn a family of noise-conditional score functions corresponding to the data density perturbed with increasingly large amounts of noise. These perturbed data densities are tied together by the Fokker-Planck equation (FPE), a partial differential equation (PDE) governing the spatial-temporal evolution of a density undergoing a diffusion process. In this work, we derive a corresponding equation, called the score FPE that characterizes the noise-conditional scores of the perturbed data densities (i.e., their gradients). Surprisingly, despite impressive empirical performance, we observe that scores learned via denoising score matching (DSM) do not satisfy the underlying score FPE. We prove that satisfying the FPE is desirable as it improves the likelihood and the degree of conservativity. Hence, we propose to regularize the DSM objective to enforce satisfaction of the score FPE, and we show the effectiveness of this approach across various datasets.
    Prioritizing Samples in Reinforcement Learning with Reducible Loss. (arXiv:2208.10483v2 [cs.LG] UPDATED)
    Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a na\"ive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.
    ERA-Solver: Error-Robust Adams Solver for Fast Sampling of Diffusion Probabilistic Models. (arXiv:2301.12935v2 [cs.LG] UPDATED)
    Though denoising diffusion probabilistic models (DDPMs) have achieved remarkable generation results, the low sampling efficiency of DDPMs still limits further applications. Since DDPMs can be formulated as diffusion ordinary differential equations (ODEs), various fast sampling methods can be derived from solving diffusion ODEs. However, we notice that previous sampling methods with fixed analytical form are not robust with the error in the noise estimated from pretrained diffusion models. In this work, we construct an error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams numerical method that consists of a predictor and a corrector. Different from the traditional predictor based on explicit Adams methods, we leverage a Lagrange interpolation function as the predictor, which is further enhanced with an error-robust strategy to adaptively select the Lagrange bases with lower error in the estimated noise. Experiments on Cifar10, LSUN-Church, and LSUN-Bedroom datasets demonstrate that our proposed ERA-Solver achieves 5.14, 9.42, and 9.69 Fenchel Inception Distance (FID) for image generation, with only 10 network evaluations.
    Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers. (arXiv:2212.04325v2 [eess.AS] UPDATED)
    Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid modes, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance.
    Accelerating Material Design with the Generative Toolkit for Scientific Discovery. (arXiv:2207.03928v4 [cs.LG] UPDATED)
    With the growing availability of data within various scientific domains, generative models hold enormous potential to accelerate scientific discovery. They harness powerful representations learned from datasets to speed up the formulation of novel hypotheses with the potential to impact material discovery broadly. We present the Generative Toolkit for Scientific Discovery (GT4SD). This extensible open-source library enables scientists, developers, and researchers to train and use state-of-the-art generative models to accelerate scientific discovery focused on material design.
    Learning to reject meets OOD detection: Are all abstentions created equal?. (arXiv:2301.12386v2 [cs.LG] UPDATED)
    Learning to reject (L2R) and out-of-distribution (OOD) detection are two classical problems, each of which involve detecting certain abnormal samples: in L2R, the goal is to detect "hard" samples on which to abstain, while in OOD detection, the goal is to detect "outlier" samples not drawn from the training distribution. Intriguingly, despite being developed in parallel literatures, both problems share a simple baseline: the maximum softmax probability (MSP) score. However, there is limited understanding of precisely how these problems relate. In this paper, we formally relate these problems, and show how they may be jointly solved. We first show that while MSP is theoretically optimal for L2R, it can be theoretically sub-optimal for OOD detection in some important practical settings. We then characterize the Bayes-optimal classifier for a unified formulation that generalizes both L2R and OOD detection. Based on this, we design a plug-in approach for learning to abstain on both inlier and OOD samples, while constraining the total abstention budget. Experiments on benchmark OOD datasets demonstrate that our approach yields competitive classification and OOD detection performance compared to baselines from both literatures.
    Continuous Soft Pseudo-Labeling in ASR. (arXiv:2211.06007v2 [cs.LG] UPDATED)
    Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.
    Spectral Maps for Learning on Subgraphs. (arXiv:2205.14938v4 [cs.LG] UPDATED)
    In graph learning, maps between graphs and their subgraphs frequently arise. For instance, when coarsening or rewiring operations are present along the pipeline, one needs to keep track of the corresponding nodes between the original and modified graphs. Classically, these maps are represented as binary node-to-node correspondence matrices and used as-is to transfer node-wise features between the graphs. In this paper, we argue that simply changing this map representation can bring notable benefits to graph learning tasks. Drawing inspiration from recent progress in geometry processing, we introduce a spectral representation for maps that is easy to integrate into existing graph learning models. This spectral representation is a compact and straightforward plug-in replacement and is robust to topological changes of the graphs. Remarkably, the representation exhibits structural properties that make it interpretable, drawing an analogy with recent results on smooth manifolds. We demonstrate the benefits of incorporating spectral maps in graph learning pipelines, addressing scenarios where a node-to-node map is not well defined, or in the absence of exact isomorphism. Our approach bears practical benefits in knowledge distillation and hierarchical learning, where we show comparable or improved performance at a fraction of the computational cost.
    STEEL: Singularity-aware Reinforcement Learning. (arXiv:2301.13152v2 [stat.ML] UPDATED)
    Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.
    Active Sequential Two-Sample Testing. (arXiv:2301.12616v2 [cs.LG] UPDATED)
    Two-sample testing tests whether the distributions generating two samples are identical. We pose the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. We devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries} sample labels to address the problem. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is given by a classification model. The classification model is adaptively updated and then used to guide an active query scheme called bimodal query to label sample features in the regions with high dependency between the feature variables and the label variables. The theoretical contributions in the paper include proof that our framework produces an \emph{anytime-valid} $p$-value; and, under reachable conditions and a mild assumption, the framework asymptotically generates a minimum normalized log-likelihood ratio statistic that a passive query scheme can only achieve when the feature variable and the label variable have the highest dependence. Lastly, we provide a \emph{query-switching (QS)} algorithm to decide when to switch from passive query to active query and adapt bimodal query to increase the testing power of our test. Extensive experiments justify our theoretical contributions and the effectiveness of QS.
    What can be learnt with wide convolutional neural networks?. (arXiv:2208.01003v4 [stat.ML] UPDATED)
    Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.
    Causal Estimation for Text Data with (Apparent) Overlap Violations. (arXiv:2210.00079v2 [stat.ML] UPDATED)
    Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.
    Fairness-aware Vision Transformer via Debiased Self-Attention. (arXiv:2301.13803v1 [cs.CV])
    Vision Transformer (ViT) has recently gained significant interest in solving computer vision (CV) problems due to its capability of extracting informative features and modeling long-range dependencies through the self-attention mechanism. To fully realize the advantages of ViT in real-world applications, recent works have explored the trustworthiness of ViT, including its robustness and explainability. However, another desiderata, fairness has not yet been adequately addressed in the literature. We establish that the existing fairness-aware algorithms (primarily designed for CNNs) do not perform well on ViT. This necessitates the need for developing our novel framework via Debiased Self-Attention (DSA). DSA is a fairness-through-blindness approach that enforces ViT to eliminate spurious features correlated with the sensitive attributes for bias mitigation. Notably, adversarial examples are leveraged to locate and mask the spurious features in the input image patches. In addition, DSA utilizes an attention weights alignment regularizer in the training objective to encourage learning informative features for target prediction. Importantly, our DSA framework leads to improved fairness guarantees over prior works on multiple prediction tasks without compromising target prediction performance
    Gaussian Noise is Nearly Instance Optimal for Private Unbiased Mean Estimation. (arXiv:2301.13850v1 [math.ST])
    We investigate unbiased high-dimensional mean estimators in differential privacy. We consider differentially private mechanisms whose expected output equals the mean of the input dataset, for every dataset drawn from a fixed convex domain $K$ in $\mathbb{R}^d$. In the setting of concentrated differential privacy, we show that, for every input such an unbiased mean estimator introduces approximately at least as much error as a mechanism that adds Gaussian noise with a carefully chosen covariance. This is true when the error is measured with respect to $\ell_p$ error for any $p \ge 2$. We extend this result to local differential privacy, and to approximate differential privacy, but for the latter the error lower bound holds either for a dataset or for a neighboring dataset. We also extend our results to mechanisms that take i.i.d.~samples from a distribution over $K$ and are unbiased with respect to the mean of the distribution.
    FedPass: Privacy-Preserving Vertical Federated Deep Learning with Adaptive Obfuscation. (arXiv:2301.12623v2 [cs.DC] UPDATED)
    Vertical federated learning (VFL) allows an active party with labeled feature to leverage auxiliary features from the passive parties to improve model performance. Concerns about the private feature and label leakage in both the training and inference phases of VFL have drawn wide research attention. In this paper, we propose a general privacy-preserving vertical federated deep learning framework called FedPass, which leverages adaptive obfuscation to protect the feature and label simultaneously. Strong privacy-preserving capabilities about private features and labels are theoretically proved (in Theorems 1 and 2). Extensive experimental result s with different datasets and network architectures also justify the superiority of FedPass against existing methods in light of its near-optimal trade-off between privacy and model performance.
    Personalized Decentralized Bilevel Optimization over Random Directed Networks. (arXiv:2210.02129v2 [stat.ML] UPDATED)
    Personalization and decentralization are two major lines of studies to realize practical federated learning in the real world. The aim of this study is to establish a general and unified approach that can solve these two problems simultaneously. In this work, we first propose a bilevel problem that can adapt to various personalization scenarios by allowing an arbitrary choice of two parameters: a client-wise outer-parameter representing heterogeneity, and a shared inner-parameter representing homogeneity across client data distributions. We then present an algorithm that can solve this bilevel problem in a decentralized manner by estimating gradients of clients' outer-costs with respect to their outer-parameters. We show that the proposed algorithm can be extended to handle a random directed network, which is one of the most robust decentralized communication classes. The proposed method achieves state-of-the-art performance on a personalization benchmark across various communication settings.
    On the Global Convergence of Fitted Q-Iteration with Two-layer Neural Network Parametrization. (arXiv:2211.07675v2 [cs.LG] UPDATED)
    Deep Q-learning based algorithms have been applied successfully in many decision making problems, while their theoretical foundations are not as well understood. In this paper, we study a Fitted Q-Iteration with two-layer ReLU neural network parameterization, and find the sample complexity guarantees for the algorithm. Our approach estimates the Q-function in each iteration using a convex optimization problem. We show that this approach achieves a sample complexity of $\tilde{\mathcal{O}}(1/\epsilon^{2})$, which is order-optimal. This result holds for a countable state-spaces and does not require any assumptions such as a linear or low rank structure on the MDP.
    ChatGPT or Human? Detect and Explain. Explaining Decisions of Machine Learning Model for Detecting Short ChatGPT-generated Text. (arXiv:2301.13852v1 [cs.CL])
    ChatGPT has the ability to generate grammatically flawless and seemingly-human replies to different types of questions from various domains. The number of its users and of its applications is growing at an unprecedented rate. Unfortunately, use and abuse come hand in hand. In this paper, we study whether a machine learning model can be effectively trained to accurately distinguish between original human and seemingly human (that is, ChatGPT-generated) text, especially when this text is short. Furthermore, we employ an explainable artificial intelligence framework to gain insight into the reasoning behind the model trained to differentiate between ChatGPT-generated and human-generated text. The goal is to analyze model's decisions and determine if any specific patterns or characteristics can be identified. Our study focuses on short online reviews, conducting two experiments comparing human-generated and ChatGPT-generated text. The first experiment involves ChatGPT text generated from custom queries, while the second experiment involves text generated by rephrasing original human-generated reviews. We fine-tune a Transformer-based model and use it to make predictions, which are then explained using SHAP. We compare our model with a perplexity score-based approach and find that disambiguation between human and ChatGPT-generated reviews is more challenging for the ML model when using rephrased text. However, our proposed approach still achieves an accuracy of 79%. Using explainability, we observe that ChatGPT's writing is polite, without specific details, using fancy and atypical vocabulary, impersonal, and typically it does not express feelings.
    Learning useful representations for shifting tasks and distributions. (arXiv:2212.07346v2 [cs.LG] UPDATED)
    Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions? Our thesis is that such scenarios are better served by representations that are richer than those obtained with a single optimization episode. We support this thesis with simple theoretical arguments and with experiments utilizing an apparently na\"{\i}ve ensembling technique: concatenating the representations obtained from multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained with a single training run. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
    Adaptively Weighted Data Augmentation Consistency Regularization for Robust Optimization under Concept Shift. (arXiv:2210.01891v2 [cs.CV] UPDATED)
    Concept shift is a prevailing problem in natural tasks like medical image segmentation where samples usually come from different subpopulations with variant correlations between features and labels. One common type of concept shift in medical image segmentation is the "information imbalance" between label-sparse samples with few (if any) segmentation labels and label-dense samples with plentiful labeled pixels. Existing distributionally robust algorithms have focused on adaptively truncating/down-weighting the "less informative" (i.e., label-sparse in our context) samples. To exploit data features of label-sparse samples more efficiently, we propose an adaptively weighted online optimization algorithm -- AdaWAC -- to incorporate data augmentation consistency regularization in sample reweighting. Our method introduces a set of trainable weights to balance the supervised loss and unsupervised consistency regularization of each sample separately. At the saddle point of the underlying objective, the weights assign label-dense samples to the supervised loss and label-sparse samples to the unsupervised consistency regularization. We provide a convergence guarantee by recasting the optimization as online mirror descent on a saddle point problem. Our empirical results demonstrate that AdaWAC not only enhances the segmentation performance and sample efficiency but also improves the robustness to concept shift on various medical image segmentation tasks with different UNet-style backbones.
    Hierarchically Clustered PCA, LLE, and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v2 [cs.LG] UPDATED)
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.
    Weak Proxies are Sufficient and Preferable for Fairness with Missing Sensitive Attributes. (arXiv:2210.03175v2 [cs.LG] UPDATED)
    Evaluating fairness can be challenging in practice because the sensitive attributes of data are often inaccessible due to privacy constraints. The go-to approach that the industry frequently adopts is using off-the-shelf proxy models to predict the missing sensitive attributes, e.g. Meta [Alao et al., 2021] and Twitter [Belli et al., 2022]. Despite its popularity, there are three important questions unanswered: (1) Is directly using proxies efficacious in measuring fairness? (2) If not, is it possible to accurately evaluate fairness using proxies only? (3) Given the ethical controversy over inferring user private information, is it possible to only use weak (i.e. inaccurate) proxies in order to protect privacy? Our theoretical analyses show that directly using proxy models can give a false sense of (un)fairness. Second, we develop an algorithm that is able to measure fairness (provably) accurately with only three properly identified proxies. Third, we show that our algorithm allows the use of only weak proxies (e.g. with only 68.85%accuracy on COMPAS), adding an extra layer of protection on user privacy. Experiments validate our theoretical analyses and show our algorithm can effectively measure and mitigate bias. Our results imply a set of practical guidelines for practitioners on how to use proxies properly. Code is available at github.com/UCSC-REAL/fair-eval.
    Deep Reinforcement Learning for Cryptocurrency Trading: Practical Approach to Address Backtest Overfitting. (arXiv:2209.05559v6 [q-fin.ST] UPDATED)
    Designing profitable and reliable trading strategies is challenging in the highly volatile cryptocurrency market. Existing works applied deep reinforcement learning methods and optimistically reported increased profits in backtesting, which may suffer from the false positive issue due to overfitting. In this paper, we propose a practical approach to address backtest overfitting for cryptocurrency trading using deep reinforcement learning. First, we formulate the detection of backtest overfitting as a hypothesis test. Then, we train the DRL agents, estimate the probability of overfitting, and reject the overfitted agents, increasing the chance of good trading performance. Finally, on 10 cryptocurrencies over a testing period from 05/01/2022 to 06/27/2022 (during which the crypto market crashed two times), we show that the less overfitted deep reinforcement learning agents have a higher return than that of more overfitted agents, an equal weight strategy, and the S&P DBM Index (market benchmark), offering confidence in possible deployment to a real market.
    Unifying Generative Models with GFlowNets and Beyond. (arXiv:2209.02606v2 [cs.LG] UPDATED)
    There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.
    Can Persistent Homology provide an efficient alternative for Evaluation of Knowledge Graph Completion Methods?. (arXiv:2301.12929v2 [cs.LG] UPDATED)
    In this paper we present a novel method, $\textit{Knowledge Persistence}$ ($\mathcal{KP}$), for faster evaluation of Knowledge Graph (KG) completion approaches. Current ranking-based evaluation is quadratic in the size of the KG, leading to long evaluation times and consequently a high carbon footprint. $\mathcal{KP}$ addresses this by representing the topology of the KG completion methods through the lens of topological data analysis, concretely using persistent homology. The characteristics of persistent homology allow $\mathcal{KP}$ to evaluate the quality of the KG completion looking only at a fraction of the data. Experimental results on standard datasets show that the proposed metric is highly correlated with ranking metrics (Hits@N, MR, MRR). Performance evaluation shows that $\mathcal{KP}$ is computationally efficient: In some cases, the evaluation time (validation+test) of a KG completion method has been reduced from 18 hours (using Hits@10) to 27 seconds (using $\mathcal{KP}$), and on average (across methods & data) reduces the evaluation time (validation+test) by $\approx$ $\textbf{99.96}\%$.
    Bias in Machine Learning Models Can Be Significantly Mitigated by Careful Training: Evidence from Neuroimaging Studies. (arXiv:2205.13421v2 [cs.LG] UPDATED)
    Despite the great promise that machine learning has offered in many fields of medicine, it has also raised concerns about potential biases and poor generalization across genders, age distributions, races and ethnicities, hospitals, and data acquisition equipment and protocols. In the current study, and in the context of three brain diseases, we provide evidence which suggests that when properly trained, machine learning models can generalize well across diverse conditions and do not necessarily suffer from bias. Specifically, by using multi-study magnetic resonance imaging consortia for diagnosing Alzheimer's disease, schizophrenia, and autism spectrum disorder, we find that well-trained models have a high area-under-the-curve (AUC) on subjects across different subgroups pertaining to attributes such as gender, age, racial groups, and different clinical studies and are unbiased under multiple fairness metrics such as demographic parity difference, equalized odds difference, equal opportunity difference etc. We find that models that incorporate multi-source data from demographic, clinical, genetic factors and cognitive scores are also unbiased. These models have better predictive AUC across subgroups than those trained only with imaging features but there are also situations when these additional features do not help.
    Revisiting Hyperparameter Tuning with Differential Privacy. (arXiv:2211.01852v2 [cs.LG] UPDATED)
    Hyperparameter tuning is a common practice in the application of machine learning but is a typically ignored aspect in the literature on privacy-preserving machine learning due to its negative effect on the overall privacy parameter. In this paper, we aim to tackle this fundamental yet challenging problem by providing an effective hyperparameter tuning framework with differential privacy. The proposed method allows us to adopt a broader hyperparameter search space and even to perform a grid search over the whole space, since its privacy loss parameter is independent of the number of hyperparameter candidates. Interestingly, it instead correlates with the utility gained from hyperparameter searching, revealing an explicit and mandatory trade-off between privacy and utility. Theoretically, we show that its additional privacy loss bound incurred by hyperparameter tuning is upper-bounded by the squared root of the gained utility. However, we note that the additional privacy loss bound would empirically scale like a squared root of the logarithm of the utility term, benefiting from the design of doubling step.
    Optimal Solutions for Joint Beamforming and Antenna Selection: From Branch and Bound to Graph Neural Imitation Learning. (arXiv:2206.05576v2 [eess.SP] UPDATED)
    This work revisits the joint beamforming (BF) and antenna selection (AS) problem, as well as its robust beamforming (RBF) version under imperfect channel state information (CSI). Such problems arise due to various reasons, e.g., the costly nature of the radio frequency (RF) chains and energy/resource-saving considerations. The joint (R)BF\&AS problem is a mixed integer and nonlinear program, and thus finding {\it optimal solutions} is often costly, if not outright impossible. The vast majority of the prior works tackled these problems using techniques such as continuous approximations, greedy methods, and supervised machine learning -- yet these approaches do not ensure optimality or even feasibility of the solutions. The main contribution of this work is threefold. First, an effective {\it branch and bound} (B\&B) framework for solving the problems of interest is proposed. Leveraging existing BF and RBF solvers, it is shown that the B\&B framework guarantees global optimality of the considered problems. Second, to expedite the potentially costly B\&B algorithm, a machine learning (ML)-based scheme is proposed to help skip intermediate states of the B\&B search tree. The learning model features a {\it graph neural network} (GNN)-based design that is resilient to a commonly encountered challenge in wireless communications, namely, the change of problem size (e.g., the number of users) across the training and test stages. Third, comprehensive performance characterizations are presented, showing that the GNN-based method retains the global optimality of B\&B with provably reduced complexity, under reasonable conditions. Numerical simulations also show that the ML-based acceleration can often achieve an order-of-magnitude speedup relative to B\&B.
    Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization. (arXiv:2212.13556v2 [cs.LG] UPDATED)
    To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
    Learning from many trajectories. (arXiv:2203.17193v2 [cs.LG] UPDATED)
    We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
    Strategyproof Decision-Making in Panel Data Settings and Beyond. (arXiv:2211.14236v2 [econ.EM] UPDATED)
    We propose a framework for decision-making in the presence of strategic agents with panel data, a standard setting in econometrics and statistics where one gets noisy, repeated measurements of multiple units. We consider a setup where there is a pre-intervention period, when the principal observes the outcomes of each unit, after which the principal uses these observations to assign a treatment to each unit. Our model can be thought of as a generalization of the synthetic controls and synthetic interventions frameworks, where units (or agents) may strategically manipulate pre-intervention outcomes to receive a more desirable intervention. We identify necessary and sufficient conditions under which a strategyproof mechanism that assigns interventions in the post-intervention period exists. Under a latent factor model assumption, we show that whenever a strategyproof mechanism exists, there is one with a simple closed form. In the setting where there is a single treatment and control (i.e., no other interventions), we establish that there is always a strategyproof mechanism, and provide an algorithm for learning such a mechanism. For the setting of multiple interventions, we provide an algorithm for learning a strategyproof mechanism, if there exists a sufficiently large gap in rewards between the different interventions. Finally, we empirically evaluate our model using real-world panel data collected from product sales over 18 months. We find that our methods compare favorably to baselines which do not take strategic interactions into consideration -- even in the presence of model misspecification. Along the way, we prove impossibility results for multi-class strategic classification, which may be of independent interest.
    Robust Reinforcement Learning in Continuous Control Tasks with Uncertainty Set Regularization. (arXiv:2207.02016v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is recognized as lacking generalization and robustness under environmental perturbations, which excessively restricts its application for real-world robotics. Prior work claimed that adding regularization to the value function is equivalent to learning a robust policy with uncertain transitions. Although the regularization-robustness transformation is appealing for its simplicity and efficiency, it is still lacking in continuous control tasks. In this paper, we propose a new regularizer named $\textbf{U}$ncertainty $\textbf{S}$et $\textbf{R}$egularizer (USR), by formulating the uncertainty set on the parameter space of the transition function. In particular, USR is flexible enough to be plugged into any existing RL framework. To deal with unknown uncertainty sets, we further propose a novel adversarial approach to generate them based on the value function. We evaluate USR on the Real-world Reinforcement Learning (RWRL) benchmark, demonstrating improvements in the robust performance for perturbed testing environments.
    Can we achieve robustness from data alone?. (arXiv:2207.11727v2 [cs.LG] UPDATED)
    We introduce a meta-learning algorithm for adversarially robust classification. The proposed method tries to be as model agnostic as possible and optimizes a dataset prior to its deployment in a machine learning system, aiming to effectively erase its non-robust features. Once the dataset has been created, in principle no specialized algorithm (besides standard gradient descent) is needed to train a robust model. We formulate the data optimization procedure as a bi-level optimization problem on kernel regression, with a class of kernels that describe infinitely wide neural nets (Neural Tangent Kernels). We present extensive experiments on standard computer vision benchmarks using a variety of different models, demonstrating the effectiveness of our method, while also pointing out its current shortcomings. In parallel, we revisit prior work that also focused on the problem of data optimization for robust classification \citep{Ily+19}, and show that being robust to adversarial attacks after standard (gradient descent) training on a suitable dataset is more challenging than previously thought.
    Meta-Learning via Classifier(-free) Diffusion Guidance. (arXiv:2210.08942v2 [cs.LG] UPDATED)
    We introduce meta-learning algorithms that perform zero-shot weight-space adaptation of neural network models to unseen tasks. Our methods repurpose the popular generative image synthesis techniques of natural language guidance and diffusion models to generate neural network weights adapted for tasks. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing multi-task and meta-learning methods in a series of zero-shot learning experiments on our Meta-VQA dataset.
    Fine-tuning or top-tuning? Transfer learning with pretrained features and fast kernel methods. (arXiv:2209.07932v2 [cs.LG] UPDATED)
    The impressive performances of deep learning architectures is associated to massive increase of models complexity. Millions of parameters need be tuned, with training and inference time scaling accordingly. But is massive fine-tuning necessary? In this paper, focusing on image classification, we consider a simple transfer learning approach exploiting pretrained convolutional features as input for a fast kernel method. We refer to this approach as top-tuning, since only the kernel classifier is trained. By performing more than 2500 training processes we show that this top-tuning approach provides comparable accuracy w.r.t. fine-tuning, with a training time that is between one and two orders of magnitude smaller. These results suggest that top-tuning provides a useful alternative to fine-tuning in small/medium datasets, especially when training efficiency is crucial.
    Py-Feat: Python Facial Expression Analysis Toolbox. (arXiv:2104.03509v3 [cs.CV] UPDATED)
    Studying facial expressions is a notoriously difficult endeavor. Recent advances in the field of affective computing have yielded impressive progress in automatically detecting facial expressions from pictures and videos. However, much of this work has yet to be widely disseminated in social science domains such as psychology. Current state of the art models require considerable domain expertise that is not traditionally incorporated into social science training programs. Furthermore, there is a notable absence of user-friendly and open-source software that provides a comprehensive set of tools and functions that support facial expression research. In this paper, we introduce Py-Feat, an open-source Python toolbox that provides support for detecting, preprocessing, analyzing, and visualizing facial expression data. Py-Feat makes it easy for domain experts to disseminate and benchmark computer vision models and also for end users to quickly process, analyze, and visualize face expression data. We hope this platform will facilitate increased use of facial expression data in human behavior research.
    A Dynamic Programming Algorithm for Finding an Optimal Sequence of Informative Measurements. (arXiv:2109.11808v4 [cs.LG] UPDATED)
    An informative measurement is the most efficient way to gain information about an unknown state. We present a first-principles derivation of a general-purpose dynamic programming algorithm that returns an optimal sequence of informative measurements by sequentially maximizing the entropy of possible measurement outcomes. This algorithm can be used by an autonomous agent or robot to decide where best to measure next, planning a path corresponding to an optimal sequence of informative measurements. The algorithm is applicable to states and controls that are either continuous or discrete, and agent dynamics that is either stochastic or deterministic; including Markov decision processes and Gaussian processes. Recent results from the fields of approximate dynamic programming and reinforcement learning, including on-line approximations such as rollout and Monte Carlo tree search, allow the measurement task to be solved in real time. The resulting solutions include non-myopic paths and measurement sequences that can generally outperform, sometimes substantially, commonly used greedy approaches. This is demonstrated for a global search task, where on-line planning for a sequence of local searches is found to reduce the number of measurements in the search by approximately half. A variant of the algorithm is derived for Gaussian processes for active sensing.
    PINCH: An Adversarial Extraction Attack Framework for Deep Learning Models. (arXiv:2209.06300v2 [cs.CR] UPDATED)
    Adversarial extraction attacks constitute an insidious threat against Deep Learning (DL) models in-which an adversary aims to steal the architecture, parameters, and hyper-parameters of a targeted DL model. Existing extraction attack literature have observed varying levels of attack success for different DL models and datasets, yet the underlying cause(s) behind their susceptibility often remain unclear, and would help facilitate creating secure DL systems. In this paper we present PINCH: an efficient and automated extraction attack framework capable of designing, deploying, and analyzing extraction attack scenarios across heterogeneous hardware platforms. Using PINCH, we perform extensive experimental evaluation of extraction attacks against 21 model architectures to explore new extraction attack scenarios and further attack staging. Our findings show (1) key extraction characteristics whereby particular model configurations exhibit strong resilience against specific attacks, (2) even partial extraction success enables further staging for other adversarial attacks, and (3) equivalent stolen models uncover differences in expressive power, yet exhibit similar captured knowledge.
    Benchmarking Large Language Models for News Summarization. (arXiv:2301.13848v1 [cs.CL])
    Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
    Reinforcement learning and decision making via single-photon quantum walks. (arXiv:2301.13669v1 [quant-ph])
    Variational quantum algorithms represent a promising approach to quantum machine learning where classical neural networks are replaced by parametrized quantum circuits. Here, we present a variational approach to quantize projective simulation (PS), a reinforcement learning model aimed at interpretable artificial intelligence. Decision making in PS is modeled as a random walk on a graph describing the agent's memory. To implement the quantized model, we consider quantum walks of single photons in a lattice of tunable Mach-Zehnder interferometers. We propose variational algorithms tailored to reinforcement learning tasks, and we show, using an example from transfer learning, that the quantized PS learning model can outperform its classical counterpart. Finally, we discuss the role of quantum interference for training and decision making, paving the way for realizations of interpretable quantum learning agents.
    What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness?. (arXiv:2210.05577v2 [cs.LG] UPDATED)
    The adversarial vulnerability of neural nets, and subsequent techniques to create robust models have attracted significant attention; yet we still lack a full understanding of this phenomenon. Here, we study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods, namely the Neural Tangent Kernel (NTK), following a growing body of work that leverages the NTK approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. We show how NTKs allow to generate adversarial examples in a ``training-free'' fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the ``lazy'' regime. We leverage this connection to provide an alternative view on robust and non-robust features, which have been suggested to underlie the adversarial brittleness of neural nets. Specifically, we define and study features induced by the eigendecomposition of the kernel to better understand the role of robust and non-robust features, the reliance on both for standard classification and the robustness-accuracy trade-off. We find that such features are surprisingly consistent across architectures, and that robust features tend to correspond to the largest eigenvalues of the model, and thus are learned early during training. Our framework allows us to identify and visualize non-robust yet useful features. Finally, we shed light on the robustness mechanism underlying adversarial training of neural nets used in practice: quantifying the evolution of the associated empirical NTK, we demonstrate that its dynamics falls much earlier into the ``lazy'' regime and manifests a much stronger form of the well known bias to prioritize learning features within the top eigenspaces of the kernel, compared to standard training.
    Bayesian Learning for Neural Networks: an algorithmic survey. (arXiv:2211.11865v4 [stat.ML] UPDATED)
    The last decade witnessed a growing interest in Bayesian learning. Yet, the technicality of the topic and the multitude of ingredients involved therein, besides the complexity of turning theory into practical implementations, limit the use of the Bayesian learning paradigm, preventing its widespread adoption across different fields and applications. This self-contained survey engages and introduces readers to the principles and algorithms of Bayesian Learning for Neural Networks. It provides an introduction to the topic from an accessible, practical-algorithmic perspective. Upon providing a general introduction to Bayesian Neural Networks, we discuss and present both standard and recent approaches for Bayesian inference, with an emphasis on solutions relying on Variational Inference and the use of Natural gradients. We also discuss the use of manifold optimization as a state-of-the-art approach to Bayesian learning. We examine the characteristic properties of all the discussed methods, and provide pseudo-codes for their implementation, paying attention to practical aspects, such as the computation of the gradients.
    Cutting Plane Selection with Analytic Centers and Multiregression. (arXiv:2212.07231v3 [math.OC] UPDATED)
    Cutting planes are a crucial component of state-of-the-art mixed-integer programming solvers, with the choice of which subset of cuts to add being vital for solver performance. We propose new distance-based measures to qualify the value of a cut by quantifying the extent to which it separates relevant parts of the relaxed feasible set. For this purpose, we use the analytic centers of the relaxation polytope or of its optimal face, as well as alternative optimal solutions of the linear programming relaxation. We assess the impact of the choice of distance measure on root node performance and throughout the whole branch-and-bound tree, comparing our measures against those prevalent in the literature. Finally, by a multi-output regression, we predict the relative performance of each measure, using static features readily available before the separation process. Our results indicate that analytic center-based methods help to significantly reduce the number of branch-and-bound nodes needed to explore the search space and that our multiregression approach can further improve on any individual method.
    Direct-Effect Risk Minimization for Domain Generalization. (arXiv:2211.14594v2 [cs.LG] UPDATED)
    We study the problem of out-of-distribution (o.o.d.) generalization where spurious correlations of attributes vary across training and test domains. This is known as the problem of correlation shift and has posed concerns on the reliability of machine learning. In this work, we introduce the concepts of direct and indirect effects from causal inference to the domain generalization problem. We argue that models that learn direct effects minimize the worst-case risk across correlation-shifted domains. To eliminate the indirect effects, our algorithm consists of two stages: in the first stage, we learn an indirect-effect representation by minimizing the prediction error of domain labels using the representation and the class label; in the second stage, we remove the indirect effects learned in the first stage by matching each data with another data of similar indirect-effect representation but of different class label. We also propose a new model selection method by matching the validation set in the same way, which is shown to improve the generalization performance of existing models on correlation-shifted datasets. Experiments on 5 correlation-shifted datasets and the DomainBed benchmark verify the effectiveness of our approach.
    Difformer: Empowering Diffusion Models on the Embedding Space for Text Generation. (arXiv:2212.09412v2 [cs.CL] UPDATED)
    Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies and analyze the challenges between the continuous data space and the embedding space which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the loss function. Secondly, as the norm of embeddings varies between popular and rare words, adding the same noise scale will lead to sub-optimal results. In addition, we find the normal level of noise causes insufficient training of the model. To address the above challenges, we propose Difformer, an embedding diffusion model based on Transformer, which consists of three essential modules including an additional anchor loss function, a layer normalization module for embeddings, and a noise factor to the Gaussian noise. Experiments on two seminal text generation tasks including machine translation and text summarization show the superiority of Difformer over compared embedding diffusion baselines.
    Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences. (arXiv:2210.11794v2 [cs.LG] UPDATED)
    Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
    Forecasting COVID- 19 cases using Statistical Models and Ontology-based Semantic Modelling: A real time data analytics approach. (arXiv:2206.02795v2 [q-bio.PE] UPDATED)
    SARS-COV-19 is the most prominent issue which many countries face today. The frequent changes in infections, recovered and deaths represents the dynamic nature of this pandemic. It is very crucial to predict the spreading rate of this virus for accurate decision making against fighting with the situation of getting infected through the virus, tracking and controlling the virus transmission in the community. We develop a prediction model using statistical time series models such as SARIMA and FBProphet to monitor the daily active, recovered and death cases of COVID-19 accurately. Then with the help of various details across each individual patient (like height, weight, gender etc.), we designed a set of rules using Semantic Web Rule Language and some mathematical models for dealing with COVID19 infected cases on an individual basis. After combining all the models, a COVID-19 Ontology is developed and performs various queries using SPARQL query on designed Ontology which accumulate the risk factors, provide appropriate diagnosis, precautions and preventive suggestions for COVID Patients. After comparing the performance of SARIMA and FBProphet, it is observed that the SARIMA model performs better in forecasting of COVID cases. On individual basis COVID case prediction, approx. 497 individual samples have been tested and classified into five different levels of COVID classes such as Having COVID, No COVID, High Risk COVID case, Medium to High Risk case, and Control needed case.
    Recipro-CAM: Gradient-free reciprocal class activation map. (arXiv:2209.14074v2 [cs.CV] UPDATED)
    Convolutional neural network (CNN) becomes one of the most popular and prominent deep learning architectures for computer vision, but its black box feature hides the internal prediction process. For this reason, AI practitioners have shed light on explainable AI to provide the interpretability of the model behavior. In particular, class activation map (CAM) and Grad-CAM based methods have shown promise results, but they have architectural limitation or gradient computing burden. To resolve these, Score-CAM has been suggested as a gradient-free method, however, it requires more execution time compared to CAM or Grad-CAM based methods. Therefore, we propose a lightweight architecture and gradient free Reciprocal CAM (Recipro-CAM) by spatially masking the extracted feature maps to exploit the correlation between activation maps and network outputs. With the proposed method, we achieved the gains of 1.78 - 3.72% in the ResNet family compared to Score-CAM in Average Drop- Coherence-Complexity (ADCC) metric, excluding the VGG-16 (1.39% drop). In addition, Recipro-CAM exhibits a saliency map generation rate similar to Grad-CAM and approximately 148 times faster than Score-CAM. The source code of Recipro-CAM is available at our data analysis framework.
    Discovery of Single Independent Latent Variable. (arXiv:2110.05887v2 [stat.ML] UPDATED)
    Latent variable discovery is a central problem in data analysis with a broad range of applications in applied science. In this work, we consider data given as an invertible mixture of two statistically independent components, and assume that one of the components is observed while the other is hidden. Our goal is to recover the hidden component. For this purpose, we propose an autoencoder equipped with a discriminator. Unlike the standard nonlinear ICA problem, which was shown to be non-identifiable, in the special case of ICA we consider here, we show that our approach can recover the component of interest up to entropy-preserving transformation. We demonstrate the performance of the proposed approach on several datasets, including image synthesis, voice cloning, and fetal ECG extraction.
    Learning in POMDPs is Sample-Efficient with Hindsight Observability. (arXiv:2301.13857v1 [cs.LG])
    POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
    Policy Gradient for s-Rectangular Robust Markov Decision Processes. (arXiv:2301.13589v1 [cs.LG])
    We present a novel robust policy gradient method (RPG) for s-rectangular robust Markov Decision Processes (MDPs). We are the first to derive the adversarial kernel in a closed form and demonstrate that it is a one-rank perturbation of the nominal kernel. This allows us to derive an RPG that is similar to the one used in non-robust MDPs, except with a robust Q-value function and an additional correction term. Both robust Q-values and correction terms are efficiently computable, thus the time complexity of our method matches that of non-robust MDPs, which is significantly faster compared to existing black box methods.
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2106.02600v3 [cs.LG] UPDATED)
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.
    Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks. (arXiv:2301.13799v1 [cs.LG])
    From natural language processing to genome sequencing, large-scale machine learning models are bringing advances to a broad range of fields. Many of these models are too large to be trained on a single machine, and instead must be distributed across multiple devices. This has motivated the research of new compute and network systems capable of handling such tasks. In particular, recent work has focused on developing management schemes which decide how to allocate distributed resources such that some overall objective, such as minimising the job completion time (JCT), is optimised. However, such studies omit explicit consideration of how much a job should be distributed, usually assuming that maximum distribution is desirable. In this work, we show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate. To address this, we propose PAC-ML (partitioning for asynchronous computing with machine learning). PAC-ML leverages a graph neural network and reinforcement learning to learn how much to partition computation graphs such that the number of jobs which meet arbitrary user-defined JCT requirements is maximised. In experiments with five real deep learning computation graphs on a recently proposed optical architecture across four user-defined JCT requirement distributions, we demonstrate PAC-ML achieving up to 56.2% lower blocking rates in dynamic job arrival settings than the canonical maximum parallelisation strategy used by most prior works.
    An Empirical Study of Quantum Dynamics as a Ground State Problem with Neural Quantum States. (arXiv:2206.09241v2 [quant-ph] UPDATED)
    We consider the Feynman-Kitaev formalism applied to a spin chain described by the transverse field Ising model. This formalism consists of building a Hamiltonian whose ground state encodes the time evolution of the spin chain at discrete time steps. To find this ground state, variational wave functions parameterised by artificial neural networks -- also known as neural quantum states (NQSs) -- are used. Our work focuses on assessing, in the context of the Feynman-Kitaev formalism, two properties of NQSs: expressivity (the possibility that variational parameters can be set to values such that the NQS is faithful to the true ground state of the system) and trainability (the process of reaching said values). We find that the considered NQSs are capable of accurately approximating the true ground state of the system, i.e., they are expressive enough ans\"atze. However, extensive hyperparameter tuning experiments show that, empirically, reaching the set of values for the variational parameters that correctly describe the ground state becomes ever more difficult as the number of time steps increase because the true ground state becomes more entangled, and the probability distribution starts to spread across the Hilbert space canonical basis.
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v2 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.
    Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles. (arXiv:2205.14116v2 [cs.LG] UPDATED)
    Counterfactual explanations describe how to modify a feature vector in order to flip the outcome of a trained classifier. Obtaining robust counterfactual explanations is essential to provide valid algorithmic recourse and meaningful explanations. We study the robustness of explanations of randomized ensembles, which are always subject to algorithmic uncertainty even when the training data is fixed. We formalize the generation of robust counterfactual explanations as a probabilistic problem and show the link between the robustness of ensemble models and the robustness of base learners. We develop a practical method with good empirical performance and support it with theoretical guarantees for ensembles of convex base learners. Our results show that existing methods give surprisingly low robustness: the validity of naive counterfactuals is below $50\%$ on most data sets and can fall to $20\%$ on problems with many features. In contrast, our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
    PADL: Language-Directed Physics-Based Character Control. (arXiv:2301.13868v1 [cs.LG])
    Developing systems that can synthesize natural and life-like motions for simulated characters has long been a focus for computer animation. But in order for these systems to be useful for downstream applications, they need not only produce high-quality motions, but must also provide an accessible and versatile interface through which users can direct a character's behaviors. Natural language provides a simple-to-use and expressive medium for specifying a user's intent. Recent breakthroughs in natural language processing (NLP) have demonstrated effective use of language-based interfaces for applications such as image generation and program synthesis. In this work, we present PADL, which leverages recent innovations in NLP in order to take steps towards developing language-directed controllers for physics-based character animation. PADL allows users to issue natural language commands for specifying both high-level tasks and low-level skills that a character should perform. We present an adversarial imitation learning approach for training policies to map high-level language commands to low-level controls that enable a character to perform the desired task and skill specified by a user's commands. Furthermore, we propose a multi-task aggregation method that leverages a language-based multiple-choice question-answering approach to determine high-level task objectives from language commands. We show that our framework can be applied to effectively direct a simulated humanoid character to perform a diverse array of complex motor skills.
    Generating Synthetic Mixed-type Longitudinal Electronic Health Records for Artificial Intelligent Applications. (arXiv:2112.12047v2 [cs.LG] UPDATED)
    The recent availability of electronic health records (EHRs) have provided enormous opportunities to develop artificial intelligence (AI) algorithms. However, patient privacy has become a major concern that limits data sharing across hospital settings and subsequently hinders the advances in AI. Synthetic data, which benefits from the development and proliferation of generative models, has served as a promising substitute for real patient EHR data. However, the current generative models are limited as they only generate single type of clinical data for a synthetic patient, i.e., either continuous-valued or discrete-valued. To mimic the nature of clinical decision-making which encompasses various data types/sources, in this study, we propose a generative adversarial network (GAN) entitled EHR-M-GAN which simultaneously synthesizes mixed-type timeseries EHR data. EHR-M-GAN is capable of capturing the multidimensional, heterogeneous, and correlated temporal dynamics in patient trajectories. We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients, and performed privacy risk evaluation of the proposed model. EHR-M-GAN has demonstrated its superiority over state-of-the-art benchmarks for synthesizing clinical timeseries with high fidelity, while addressing the limitations regarding data types and dimensionality in the current generative models. Notably, prediction models for outcomes of intensive care performed significantly better when training data was augmented with the addition of EHR-M-GAN-generated timeseries. EHR-M-GAN may have use in developing AI algorithms in resource-limited settings, lowering the barrier for data acquisition while preserving patient privacy.
    Disentangling Model Multiplicity in Deep Learning. (arXiv:2206.08890v2 [cs.LG] UPDATED)
    Model multiplicity is a well-known but poorly understood phenomenon that undermines the generalisation guarantees of machine learning models. It appears when two models with similar training-time performance differ in their predictions and real-world performance characteristics. This observed 'predictive' multiplicity (PM) also implies elusive differences in the internals of the models, their 'representational' multiplicity (RM). We introduce a conceptual and experimental setup for analysing RM by measuring activation similarity via singular vector canonical correlation analysis (SVCCA). We show that certain differences in training methods systematically result in larger RM than others and evaluate RM and PM over a finite sample as predictors for generalizability. We further correlate RM with PM measured by the variance in i.i.d. and out-of-distribution test predictions in four standard image data sets. Finally, instead of attempting to eliminate RM, we call for its systematic measurement and maximal exposure.
    FedBC: Calibrating Global and Local Models via Federated Learning Beyond Consensus. (arXiv:2206.10815v3 [cs.LG] UPDATED)
    In this work, we quantitatively calibrate the performance of global and local models in federated learning through a multi-criterion optimization-based framework, which we cast as a constrained program. The objective of a device is its local objective, which it seeks to minimize while satisfying nonlinear constraints that quantify the proximity between the local and the global model. By considering the Lagrangian relaxation of this problem, we develop a novel primal-dual method called Federated Learning Beyond Consensus (\texttt{FedBC}). Theoretically, we establish that \texttt{FedBC} converges to a first-order stationary point at rates that matches the state of the art, up to an additional error term that depends on a tolerance parameter introduced to scalarize the multi-criterion formulation. Finally, we demonstrate that \texttt{FedBC} balances the global and local model test accuracy metrics across a suite of datasets (Synthetic, MNIST, CIFAR-10, Shakespeare), achieving competitive performance with state-of-the-art.
    Toward Efficient Gradient-Based Value Estimation. (arXiv:2301.13757v1 [cs.LG])
    Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
    Salient Conditional Diffusion for Defending Against Backdoor Attacks. (arXiv:2301.13862v1 [cs.LG])
    We propose a novel algorithm, Salient Conditional Diffusion (Sancdifi), a state-of-the-art defense against backdoor attacks. Sancdifi uses a denoising diffusion probabilistic model (DDPM) to degrade an image with noise and then recover said image using the learned reverse diffusion. Critically, we compute saliency map-based masks to condition our diffusion, allowing for stronger diffusion on the most salient pixels by the DDPM. As a result, Sancdifi is highly effective at diffusing out triggers in data poisoned by backdoor attacks. At the same time, it reliably recovers salient features when applied to clean data. This performance is achieved without requiring access to the model parameters of the Trojan network, meaning Sancdifi operates as a black-box defense.
    FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning. (arXiv:2205.07246v3 [cs.LG] UPDATED)
    Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model's learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model's learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https://github.com/microsoft/Semi-supervised-learning.
    Normalizing Flows for Interventional Density Estimation. (arXiv:2209.06203v3 [cs.LG] UPDATED)
    Existing machine learning methods for causal inference usually estimate quantities expressed via the mean of potential outcomes (e.g., average treatment effect). However, such quantities do not capture the full information about the distribution of potential outcomes. In this work, we estimate the density of potential outcomes after interventions from observational data. For this, we propose a novel, fully-parametric deep learning method called Interventional Normalizing Flows. Specifically, we combine two normalizing flows, namely (i) a teacher flow for estimating nuisance parameters and (ii) a student flow for a parametric estimation of the density of potential outcomes. We further develop a tractable optimization objective based on a one-step bias correction for an efficient and doubly robust estimation of the student flow parameters. As a result our Interventional Normalizing Flows offer a properly normalized density estimator. Across various experiments, we demonstrate that our Interventional Normalizing Flows are expressive and highly effective, and scale well with both sample size and high-dimensional confounding. To the best of our knowledge, our Interventional Normalizing Flows are the first fully-parametric, deep learning method for density estimation of potential outcomes.
    A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser. (arXiv:2301.13731v1 [stat.ML])
    This paper presents a new convergent Plug-and-Play (PnP) algorithm. PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem. Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-$\alpha$PGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.
    OPT-GAN: A Broad-Spectrum Global Optimizer for Black-box Problems by Learning Distribution. (arXiv:2102.03888v6 [cs.LG] UPDATED)
    Black-box optimization (BBO) algorithms are concerned with finding the best solutions for problems with missing analytical details. Most classical methods for such problems are based on strong and fixed a priori assumptions, such as Gaussianity. However, the complex real-world problems, especially when the global optimum is desired, could be very far from the a priori assumptions because of their diversities, causing unexpected obstacles. In this study, we propose a generative adversarial net-based broad-spectrum global optimizer (OPT-GAN) which estimates the distribution of optimum gradually, with strategies to balance exploration-exploitation trade-off. It has potential to better adapt to the regularity and structure of diversified landscapes than other methods with fixed prior, e.g., Gaussian assumption or separability. Experiments on diverse BBO benchmarks and high dimensional real world applications exhibit that OPT-GAN outperforms other traditional and neural net-based BBO algorithms.
    Subgroup Fairness in Two-Sided Markets. (arXiv:2106.02702v2 [cs.AI] UPDATED)
    It is well known that two-sided markets are unfair in a number of ways. For instance, female workers at Uber earn less than their male colleagues per mile driven. Similar observations have been made for other minority subgroups in other two-sided markets. Here, we suggest a novel market-clearing mechanism for two-sided markets, which promotes equalisation of the pay per hour worked across multiple subgroups, as well as within each subgroup. In the process, we introduce a novel notion of subgroup fairness (which we call Inter-fairness), which can be combined with other notions of fairness within each subgroup (called Intra-fairness), and the utility for the customers (Customer-Care) in the objective of the market-clearing problem. While the novel non-linear terms in the objective complicate market clearing by making the problem non-convex, we show that a certain non-convex augmented Lagrangian relaxation can be approximated to any precision in time polynomial in the number of market participants using semi-definite programming. This makes it possible to implement the market-clearing mechanism efficiently. On the example of driver-ride assignment in an Uber-like system, we demonstrate the efficacy and scalability of the approach, and trade-offs between Inter- and Intra-fairness.
    Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors. (arXiv:2201.06463v4 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference, Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs describe (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.
    SecGNN: Privacy-Preserving Graph Neural Network Training and Inference as a Cloud Service. (arXiv:2202.07835v2 [cs.CR] UPDATED)
    Graphs are widely used to model the complex relationships among entities. As a powerful tool for graph analytics, graph neural networks (GNNs) have recently gained wide attention due to its end-to-end processing capabilities. With the proliferation of cloud computing, it is increasingly popular to deploy the services of complex and resource-intensive model training and inference in the cloud due to its prominent benefits. However, GNN training and inference services, if deployed in the cloud, will raise critical privacy concerns about the information-rich and proprietary graph data (and the resulting model). While there has been some work on secure neural network training and inference, they all focus on convolutional neural networks handling images and text rather than complex graph data with rich structural information. In this paper, we design, implement, and evaluate SecGNN, the first system supporting privacy-preserving GNN training and inference services in the cloud. SecGNN is built from a synergy of insights on lightweight cryptography and machine learning techniques. We deeply examine the procedure of GNN training and inference, and devise a series of corresponding secure customized protocols to support the holistic computation. Extensive experiments demonstrate that SecGNN achieves comparable plaintext training and inference accuracy, with promising performance.
    Optimal precision for GANs. (arXiv:2207.10541v2 [cs.LG] UPDATED)
    Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs.
    Fairness and Accuracy under Domain Generalization. (arXiv:2301.13323v1 [cs.LG])
    As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm. Model implementation is available at https://github.com/pth1993/FATDM.  ( 2 min )
    Transport with Support: Data-Conditional Diffusion Bridges. (arXiv:2301.13636v1 [cs.LG])
    The dynamic Schr\"odinger bridge problem provides an appealing setting for solving optimal transport problems by learning non-linear diffusion processes using efficient iterative solvers. Recent works have demonstrated state-of-the-art results (eg. in modelling single-cell embryo RNA sequences or sampling from complex posteriors) but are limited to learning bridges with only initial and terminal constraints. Our work extends this paradigm by proposing the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and optimal control into learning the diffusion process, enabling constrained stochastic processes governed by sparse observations at intermediate stages and terminal constraints. We assess the effectiveness of our method on synthetic and real-world data and show that the ISB generalises well to high-dimensional data, is computationally efficient, and provides accurate estimates of the marginals at intermediate and terminal times.
    Structure Learning and Parameter Estimation for Graphical Models via Penalized Maximum Likelihood Methods. (arXiv:2301.13269v1 [stat.ML])
    Probabilistic graphical models (PGMs) provide a compact and flexible framework to model very complex real-life phenomena. They combine the probability theory which deals with uncertainty and logical structure represented by a graph which allows one to cope with the computational complexity and also interpret and communicate the obtained knowledge. In the thesis, we consider two different types of PGMs: Bayesian networks (BNs) which are static, and continuous time Bayesian networks which, as the name suggests, have a temporal component. We are interested in recovering their true structure, which is the first step in learning any PGM. This is a challenging task, which is interesting in itself from the causal point of view, for the purposes of interpretation of the model and the decision-making process. All approaches for structure learning in the thesis are united by the same idea of maximum likelihood estimation with the LASSO penalty. The problem of structure learning is reduced to the problem of finding non-zero coefficients in the LASSO estimator for a generalized linear model. In the case of CTBNs, we consider the problem both for complete and incomplete data. We support the theoretical results with experiments.  ( 2 min )
    BRAIxDet: Learning to Detect Malignant Breast Lesion with Incomplete Annotations. (arXiv:2301.13418v1 [cs.CV])
    Methods to detect malignant lesions from screening mammograms are usually trained with fully annotated datasets, where images are labelled with the localisation and classification of cancerous lesions. However, real-world screening mammogram datasets commonly have a subset that is fully annotated and another subset that is weakly annotated with just the global classification (i.e., without lesion localisation). Given the large size of such datasets, researchers usually face a dilemma with the weakly annotated subset: to not use it or to fully annotate it. The first option will reduce detection accuracy because it does not use the whole dataset, and the second option is too expensive given that the annotation needs to be done by expert radiologists. In this paper, we propose a middle-ground solution for the dilemma, which is to formulate the training as a weakly- and semi-supervised learning problem that we refer to as malignant breast lesion detection with incomplete annotations. To address this problem, our new method comprises two stages, namely: 1) pre-training a multi-view mammogram classifier with weak supervision from the whole dataset, and 2) extending the trained classifier to become a multi-view detector that is trained with semi-supervised student-teacher learning, where the training set contains fully and weakly-annotated mammograms. We provide extensive detection results on two real-world screening mammogram datasets containing incomplete annotations, and show that our proposed approach achieves state-of-the-art results in the detection of malignant breast lesions with incomplete annotations.
    Towards Learned Emulation of Interannual Water Isotopologue Variations in General Circulation Models. (arXiv:2301.13462v1 [physics.ao-ph])
    Simulating abundances of stable water isotopologues, i.e. molecules differing in their isotopic composition, within climate models allows for comparisons with proxy data and, thus, for testing hypotheses about past climate and validating climate models under varying climatic conditions. However, many models are run without explicitly simulating water isotopologues. We investigate the possibility to replace the explicit physics-based simulation of oxygen isotopic composition in precipitation using machine learning methods. These methods estimate isotopic composition at each time step for given fields of surface temperature and precipitation amount. We implement convolutional neural networks (CNNs) based on the successful UNet architecture and test whether a spherical network architecture outperforms the naive approach of treating Earth's latitude-longitude grid as a flat image. Conducting a case study on a last millennium run with the iHadCM3 climate model, we find that roughly 40\% of the temporal variance in the isotopic composition is explained by the emulations on interannual and monthly timescale, with spatially varying emulation quality. A modified version of the standard UNet architecture for flat images yields results that are equally good as the predictions by the spherical CNN. We test generalization to last millennium runs of other climate models and find that while the tested deep learning methods yield the best results on iHadCM3 data, the performance drops when predicting on other models and is comparable to simple pixel-wise linear regression. An extended choice of predictor variables and improving the robustness of learned climate--oxygen isotope relationships should be explored in future work.
    STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. (arXiv:2207.05022v3 [cs.LG] UPDATED)
    Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v2 [stat.ML] UPDATED)
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings: valid inference after data peeking requires correcting for multiple testing but such corrections generally result in low power. Following the principle of testing by betting, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g, the Hilbert-Schmidt independence criterion. Our test is valid under non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Automated speech- and text-based classification of neuropsychiatric conditions in a multidiagnostic setting. (arXiv:2301.06916v2 [cs.CL] UPDATED)
    Speech patterns have been identified as potential diagnostic markers for neuropsychiatric conditions. However, most studies only compare a single clinical group to healthy controls, whereas clinical practice often requires differentiating between multiple potential diagnoses (multiclass settings). To address this, we assembled a dataset of repeated recordings from 420 participants (67 with major depressive disorder, 106 with schizophrenia and 46 with autism, as well as matched controls), and tested the performance of a range of conventional machine learning models and advanced Transformer models on both binary and multiclass classification, based on voice and text features. While binary models performed comparably to previous research (F1 scores between 0.54-0.75 for autism spectrum disorder, ASD; 0.67-0.92 for major depressive disorder, MDD; and 0.71-0.83 for schizophrenia); when differentiating between multiple diagnostic groups performance decreased markedly (F1 scores between 0.35-0.44 for ASD, 0.57-0.75 for MDD, 0.15-0.66 for schizophrenia, and 0.38-0.52 macro F1). Combining voice and text-based models yielded increased performance, suggesting that they capture complementary diagnostic information. Our results indicate that models trained on binary classification may learn to rely on markers of generic differences between clinical and non-clinical populations, or markers of clinical features that overlap across conditions, rather than identifying markers specific to individual conditions. We provide recommendations for future research in the field, suggesting increased focus on developing larger transdiagnostic datasets that include more fine-grained clinical features, and that can support the development of models that better capture the complexity of neuropsychiatric conditions and naturalistic diagnostic assessment.
    Low Complexity Adaptive Machine Learning Approaches for End-to-End Latency Prediction. (arXiv:2301.13536v1 [cs.NI])
    Software Defined Networks have opened the door to statistical and AI-based techniques to improve efficiency of networking. Especially to ensure a certain Quality of Service (QoS) for specific applications by routing packets with awareness on content nature (VoIP, video, files, etc.) and its needs (latency, bandwidth, etc.) to use efficiently resources of a network. Monitoring and predicting various Key Performance Indicators (KPIs) at any level may handle such problems while preserving network bandwidth. The question addressed in this work is the design of efficient, low-cost adaptive algorithms for KPI estimation, monitoring and prediction. We focus on end-to-end latency prediction, for which we illustrate our approaches and results on data obtained from a public generator provided after the recent international challenge on GNN [12]. In this paper, we improve our previously proposed low-cost estimators [6] by adding the adaptive dimension, and show that the performances are minimally modified while gaining the ability to track varying networks.
    Inference Time Evidences of Adversarial Attacks for Forensic on Transformers. (arXiv:2301.13356v1 [cs.CV])
    Vision Transformers (ViTs) are becoming a very popular paradigm for vision tasks as they achieve state-of-the-art performance on image classification. However, although early works implied that this network structure had increased robustness against adversarial attacks, some works argue ViTs are still vulnerable. This paper presents our first attempt toward detecting adversarial attacks during inference time using the network's input and outputs as well as latent features. We design four quantifications (or derivatives) of input, output, and latent vectors of ViT-based models that provide a signature of the inference, which could be beneficial for the attack detection, and empirically study their behavior over clean samples and adversarial samples. The results demonstrate that the quantifications from input (images) and output (posterior probabilities) are promising for distinguishing clean and adversarial samples, while latent vectors offer less discriminative power, though they give some insights on how adversarial perturbations work.
    Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees. (arXiv:2301.13375v1 [cs.LG])
    Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning in real-world decision making applications. In particular, we require algorithms that can guarantee robust, safe performance in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In this work, we propose a safe reinforcement learning framework with robustness guarantees through the use of an optimal transport cost uncertainty set. We provide an efficient, theoretically supported implementation based on Optimal Transport Perturbations, which can be applied in a completely offline fashion using only data collected in a nominal training environment. We demonstrate the robust, safe performance of our approach on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.
    Unconstrained Dynamic Regret via Sparse Coding. (arXiv:2301.13349v1 [cs.LG])
    Motivated by time series forecasting, we study Online Linear Optimization (OLO) under the coupling of two problem structures: the domain is unbounded, and the performance of an algorithm is measured by its dynamic regret. Handling either of them requires the regret bound to depend on certain complexity measure of the comparator sequence -- specifically, the comparator norm in unconstrained OLO, and the path length in dynamic regret. In contrast to a recent work (Jacobsen & Cutkosky, 2022) that adapts to the combination of these two complexity measures, we propose an alternative complexity measure by recasting the problem into sparse coding. Adaptivity can be achieved by a simple modular framework, which naturally exploits more intricate prior knowledge of the environment. Along the way, we also present a new gradient adaptive algorithm for static unconstrained OLO, designed using novel continuous time machinery. This could be of independent interest.
    Sequential Strategic Screening. (arXiv:2301.13397v1 [cs.LG])
    We initiate the study of strategic behavior in screening processes with multiple classifiers. We focus on two contrasting settings: a conjunctive setting in which an individual must satisfy all classifiers simultaneously, and a sequential setting in which an individual to succeed must satisfy classifiers one at a time. In other words, we introduce the combination of strategic classification with screening processes. We show that sequential screening pipelines exhibit new and surprising behavior where individuals can exploit the sequential ordering of the tests to zig-zag between classifiers without having to simultaneously satisfy all of them. We demonstrate an individual can obtain a positive outcome using a limited manipulation budget even when far from the intersection of the positive regions of every classifier. Finally, we consider a learner whose goal is to design a sequential screening process that is robust to such manipulations, and provide a construction for the learner that optimizes a natural objective.
    Learning Against Distributional Uncertainty: On the Trade-off Between Robustness and Specificity. (arXiv:2301.13565v1 [cs.LG])
    Trustworthy machine learning aims at combating distributional uncertainties in training data distributions compared to population distributions. Typical treatment frameworks include the Bayesian approach, (min-max) distributionally robust optimization (DRO), and regularization. However, two issues have to be raised: 1) All these methods are biased estimators of the true optimal cost; 2) the prior distribution in the Bayesian method, the radius of the distributional ball in the DRO method, and the regularizer in the regularization method are difficult to specify. This paper studies a new framework that unifies the three approaches and that addresses the two challenges mentioned above. The asymptotic properties (e.g., consistency and asymptotic normalities), non-asymptotic properties (e.g., unbiasedness and generalization error bound), and a Monte--Carlo-based solution method of the proposed model are studied. The new model reveals the trade-off between the robustness to the unseen data and the specificity to the training data.
    An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation. (arXiv:2301.13383v1 [cs.SD])
    Pitch and meter are two fundamental music features for symbolic music generation tasks, where researchers usually choose different encoding methods depending on specific goals. However, the advantages and drawbacks of different encoding methods have not been frequently discussed. This paper presents a integrated analysis of the influence of two low-level feature, pitch and meter, on the performance of a token-based sequential music generation model. First, the commonly used MIDI number encoding and a less used class-octave encoding are compared. Second, an dense intra-bar metric grid is imposed to the encoded sequence as auxiliary features. Different complexity and resolutions of the metric grid are compared. For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) are compared; for duration resolution, 4, 8, 12 and 16 subdivisions per beat are compared. All different encodings are tested on separately trained Transformer-XL models for a melody generation task. Regarding distribution similarity of several objective evaluation metrics to the test dataset, results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics; finer grids and multiple-token grids improve the rhythmic quality, but also suffer from over-fitting at early training stage. Results display a general phenomenon of over-fitting from two aspects, the pitch embedding space and the test loss of the single-token grid encoding. From a practical perspective, we both demonstrate the feasibility and raise the concern of easy over-fitting problem of using smaller networks and lower embedding dimensions on the generation task. The findings can also contribute to futural models in terms of feature engineering.
    Classified as unknown: A novel Bayesian neural network. (arXiv:2301.13401v1 [cs.LG])
    We establish estimations for the parameters of the output distribution for the softmax activation function using the probit function. As an application, we develop a new efficient Bayesian learning algorithm for fully connected neural networks, where training and predictions are performed within the Bayesian inference framework in closed-form. This approach allows sequential learning and requires no computationally expensive gradient calculation and Monte Carlo sampling. Our work generalizes the Bayesian algorithm for a single perceptron for binary classification in \cite{H} to multi-layer perceptrons for multi-class classification.
    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. (arXiv:2301.13688v1 [cs.AI])
    We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
    Fast Resolution Agnostic Neural Techniques to Solve Partial Differential Equations. (arXiv:2301.13331v1 [cs.AI])
    Numerical approximations of partial differential equations (PDEs) are routinely employed to formulate the solution of physics, engineering and mathematical problems involving functions of several variables, such as the propagation of heat or sound, fluid flow, elasticity, electrostatics, electrodynamics, and more. While this has led to solving many complex phenomena, there are still significant limitations. Conventional approaches such as Finite Element Methods (FEMs) and Finite Differential Methods (FDMs) require considerable time and are computationally expensive. In contrast, machine learning-based methods such as neural networks are faster once trained, but tend to be restricted to a specific discretization. This article aims to provide a comprehensive summary of conventional methods and recent machine learning-based methods to approximate PDEs numerically. Furthermore, we highlight several key architectures centered around the neural operator, a novel and fast approach (1000x) to learning the solution operator of a PDE. We will note how these new computational approaches can bring immense advantages in tackling many problems in fundamental and applied physics.
    DNN Explanation for Safety Analysis: an Empirical Evaluation of Clustering-based Approaches. (arXiv:2301.13506v1 [cs.SE])
    The adoption of deep neural networks (DNNs) in safety-critical contexts is often prevented by the lack of effective means to explain their results, especially when they are erroneous. In our previous work, we proposed a white-box approach (HUDD) and a black-box approach (SAFE) to automatically characterize DNN failures. They both identify clusters of similar images from a potentially large set of images leading to DNN failures. However, the analysis pipelines for HUDD and SAFE were instantiated in specific ways according to common practices, deferring the analysis of other pipelines to future work. In this paper, we report on an empirical evaluation of 99 different pipelines for root cause analysis of DNN failures. They combine transfer learning, autoencoders, heatmaps of neuron relevance, dimensionality reduction techniques, and different clustering algorithms. Our results show that the best pipeline combines transfer learning, DBSCAN, and UMAP. It leads to clusters almost exclusively capturing images of the same failure scenario, thus facilitating root cause analysis. Further, it generates distinct clusters for each root cause of failure, thus enabling engineers to detect all the unsafe scenarios. Interestingly, these results hold even for failure scenarios that are only observed in a small percentage of the failing images.
    Holistic Graph-based Motion Prediction. (arXiv:2301.13545v1 [cs.RO])
    Motion prediction for automated vehicles in complex environments is a difficult task that is to be mastered when automated vehicles are to be used in arbitrary situations. Many factors influence the future motion of traffic participants starting with traffic rules and reaching from the interaction between each other to personal habits of human drivers. Therefore we present a novel approach for a graph-based prediction based on a heterogeneous holistic graph representation that combines temporal information, properties and relations between traffic participants as well as relations with static elements like the road network. The information are encoded through different types of nodes and edges that both are enriched with arbitrary features. We evaluated the approach on the INTERACTION and the Argoverse dataset and conducted an informative ablation study to demonstrate the benefit of different types of information for the motion prediction quality.
    Contrast and Clustering: Learning Neighborhood Pair Representation for Source-free Domain Adaptation. (arXiv:2301.13428v1 [cs.CV])
    Domain adaptation has attracted a great deal of attention in the machine learning community, but it requires access to source data, which often raises concerns about data privacy. We are thus motivated to address these issues and propose a simple yet efficient method. This work treats domain adaptation as an unsupervised clustering problem and trains the target model without access to the source data. Specifically, we propose a loss function called contrast and clustering (CaC), where a positive pair term pulls neighbors belonging to the same class together in the feature space to form clusters, while a negative pair term pushes samples of different classes apart. In addition, extended neighbors are taken into account by querying the nearest neighbor indexes in the memory bank to mine for more valuable negative pairs. Extensive experiments on three common benchmarks, VisDA, Office-Home and Office-31, demonstrate that our method achieves state-of-the-art performance. The code will be made publicly available at https://github.com/yukilulu/CaC.
    DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models. (arXiv:2301.13629v1 [cs.LG])
    Spatio-temporal graph neural networks (STGNN) have emerged as the dominant model for spatio-temporal graph (STG) forecasting. Despite their success, they fail to model intrinsic uncertainties within STG data, which cripples their practicality in downstream tasks for decision-making. To this end, this paper focuses on probabilistic STG forecasting, which is challenging due to the difficulty in modeling uncertainties and complex ST dependencies. In this study, we present the first attempt to generalize the popular denoising diffusion probabilistic models to STGs, leading to a novel non-autoregressive framework called DiffSTG, along with the first denoising network UGnet for STG in the framework. Our approach combines the spatio-temporal learning capabilities of STGNNs with the uncertainty measurements of diffusion models. Extensive experiments validate that DiffSTG reduces the Continuous Ranked Probability Score (CRPS) by 4%-14%, and Root Mean Squared Error (RMSE) by 2%-7% over existing methods on three real-world datasets.
    Support Exploration Algorithm for Sparse Support Recovery. (arXiv:2301.13584v1 [cs.LG])
    We introduce a new algorithm promoting sparsity called {\it Support Exploration Algorithm (SEA)} and analyze it in the context of support recovery/model selection problems.The algorithm can be interpreted as an instance of the {\it straight-through estimator (STE)} applied to the resolution of a sparse linear inverse problem. SEA uses a non-sparse exploratory vector and makes it evolve in the input space to select the sparse support. We put to evidence an oracle update rule for the exploratory vector and consider the STE update. The theoretical analysis establishes general sufficient conditions of support recovery. The general conditions are specialized to the case where the matrix $A$ performing the linear measurements satisfies the {\it Restricted Isometry Property (RIP)}.Experiments show that SEA can efficiently improve the results of any algorithm. Because of its exploratory nature, SEA also performs remarkably well when the columns of $A$ are strongly coherent.
    Stabilize Deep ResNet with A Sharp Scaling Factor $\tau$. (arXiv:1903.07120v5 [cs.LG] UPDATED)
    We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
    Superhuman Fairness. (arXiv:2301.13420v1 [cs.LG])
    The fairness of machine learning-based decisions has become an increasingly important focus in the design of supervised machine learning methods. Most fairness approaches optimize a specified trade-off between performance measure(s) (e.g., accuracy, log loss, or AUC) and fairness metric(s) (e.g., demographic parity, equalized odds). This begs the question: are the right performance-fairness trade-offs being specified? We instead re-cast fair machine learning as an imitation learning task by introducing superhuman fairness, which seeks to simultaneously outperform human decisions on multiple predictive performance and fairness measures. We demonstrate the benefits of this approach given suboptimal decisions.
    Exploring QSAR Models for Activity-Cliff Prediction. (arXiv:2301.13644v1 [cs.LG])
    Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
    Collision-aware In-hand 6D Object Pose Estimation using Multiple Vision-based Tactile Sensors. (arXiv:2301.13667v1 [cs.RO])
    In this paper, we address the problem of estimating the in-hand 6D pose of an object in contact with multiple vision-based tactile sensors. We reason on the possible spatial configurations of the sensors along the object surface. Specifically, we filter contact hypotheses using geometric reasoning and a Convolutional Neural Network (CNN), trained on simulated object-agnostic images, to promote those that better comply with the actual tactile images from the sensors. We use the selected sensors configurations to optimize over the space of 6D poses using a Gradient Descent-based approach. We finally rank the obtained poses by penalizing those that are in collision with the sensors. We carry out experiments in simulation using the DIGIT vision-based sensor with several objects, from the standard YCB model set. The results demonstrate that our approach estimates object poses that are compatible with actual object-sensor contacts in $87.5\%$ of cases while reaching an average positional error in the order of $2$ centimeters. Our analysis also includes qualitative results of experiments with a real DIGIT sensor.
    The passive symmetries of machine learning. (arXiv:2301.13724v1 [stat.ML])
    Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don'ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. While this paper is purely conceptual, we believe that it can have a significant impact on helping machine learning make the transition that took place for modern physics in the first half of the Twentieth century.
    Learning, Fast and Slow: A Goal-Directed Memory-Based Approach for Dynamic Environments. (arXiv:2301.13758v1 [cs.AI])
    Model-based next state prediction and state value prediction are slow to converge. To address these challenges, we do the following: i) Instead of a neural network, we do model-based planning using a parallel memory retrieval system (which we term the slow mechanism); ii) Instead of learning state values, we guide the agent's actions using goal-directed exploration, by using a neural network to choose the next action given the current state and the goal state (which we term the fast mechanism). The goal-directed exploration is trained online using hippocampal replay of visited states and future imagined states every single time step, leading to fast and efficient training. Empirical studies show that our proposed method has a 92% solve rate across 100 episodes in a dynamically changing grid world, significantly outperforming state-of-the-art actor critic mechanisms such as PPO (54%), TRPO (50%) and A2C (24%). Ablation studies demonstrate that both mechanisms are crucial. We posit that the future of Reinforcement Learning (RL) will be to model goals and sub-goals for various tasks, and plan it out in a goal-directed memory-based approach.
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R. (arXiv:2103.09603v4 [stat.ML] UPDATED)
    The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.
    Review of methods for automatic cerebral microbleeds detection. (arXiv:2301.13549v1 [cs.CV])
    Cerebral microbleeds detection is an important and challenging task. With the gaining popularity of the MRI, the ability to detect cerebral microbleeds also raises. Unfortunately, for radiologists, it is a time-consuming and laborious procedure. For this reason, various solutions to automate this process have been proposed for several years, but none of them is currently used in medical practice. In this context, the need to systematize the existing knowledge and best practices has been recognized as a factor facilitating the imminent synthesis of a real CMBs detection system practically applicable in medicine. To the best of our knowledge, all available publications regarding automatic cerebral microbleeds detection have been gathered, described, and assessed in this paper in order to distinguish the current research state and provide a starting point for future studies.
    A Mathematical Model for Curriculum Learning. (arXiv:2301.13833v1 [cs.LG])
    Curriculum learning (CL) - training using samples that are generated and presented in a meaningful order - was introduced in the machine learning context around a decade ago. While CL has been extensively used and analysed empirically, there has been very little mathematical justification for its advantages. We introduce a CL model for learning the class of k-parities on d bits of a binary string with a neural network trained by stochastic gradient descent (SGD). We show that a wise choice of training examples, involving two or more product distributions, allows to reduce significantly the computational cost of learning this class of functions, compared to learning under the uniform distribution. We conduct experiments to support our analysis. Furthermore, we show that for another class of functions - namely the `Hamming mixtures' - CL strategies involving a bounded number of product distributions are not beneficial, while we conjecture that CL with unbounded many curriculum steps can learn this class efficiently.
    What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods. (arXiv:2112.04417v3 [cs.CV] UPDATED)
    A multitude of explainability methods and associated fidelity performance metrics have been proposed to help better understand how modern AI systems make decisions. However, much of the current work has remained theoretical -- without much consideration for the human end-user. In particular, it is not yet known (1) how useful current explainability methods are in practice for more real-world scenarios and (2) how well associated performance metrics accurately predict how much knowledge individual explanations contribute to a human end-user trying to understand the inner-workings of the system. To fill this gap, we conducted psychophysics experiments at scale to evaluate the ability of human participants to leverage representative attribution methods for understanding the behavior of different image classifiers representing three real-world scenarios: identifying bias in an AI system, characterizing the visual strategy it uses for tasks that are too difficult for an untrained non-expert human observer as well as understanding its failure cases. Our results demonstrate that the degree to which individual attribution methods help human participants better understand an AI system varied widely across these scenarios. This suggests a critical need for the field to move past quantitative improvements of current attribution methods towards the development of complementary approaches that provide qualitatively different sources of information to human end-users.
    Towards a Defense Against Federated Backdoor Attacks Under Continuous Training. (arXiv:2205.11736v4 [cs.LG] UPDATED)
    Backdoor attacks are dangerous and difficult to prevent in federated learning (FL), where training data is sourced from untrusted clients over long periods of time. These difficulties arise because: (a) defenders in FL do not have access to raw training data, and (b) a new phenomenon we identify called backdoor leakage causes models trained continuously to eventually suffer from backdoors due to cumulative errors in defense mechanisms. We propose shadow learning, a framework for defending against backdoor attacks in the FL setting under long-range training. Shadow learning trains two models in parallel: a backbone model and a shadow model. The backbone is trained without any defense mechanism to obtain good performance on the main task. The shadow model combines filtering of malicious clients with early-stopping to control the attack success rate even as the data distribution changes. We theoretically motivate our design and show experimentally that our framework significantly improves upon existing defenses against backdoor attacks.
    Company-as-Tribe: Company Financial Risk Assessment on Tribe-Style Graph with Hierarchical Graph Neural Networks. (arXiv:2301.13492v1 [cs.LG])
    Company financial risk is ubiquitous and early risk assessment for listed companies can avoid considerable losses. Traditional methods mainly focus on the financial statements of companies and lack the complex relationships among them. However, the financial statements are often biased and lagged, making it difficult to identify risks accurately and timely. To address the challenges, we redefine the problem as \textbf{company financial risk assessment on tribe-style graph} by taking each listed company and its shareholders as a tribe and leveraging financial news to build inter-tribe connections. Such tribe-style graphs present different patterns to distinguish risky companies from normal ones. However, most nodes in the tribe-style graph lack attributes, making it difficult to directly adopt existing graph learning methods (e.g., Graph Neural Networks(GNNs)). In this paper, we propose a novel Hierarchical Graph Neural Network (TH-GNN) for Tribe-style graphs via two levels, with the first level to encode the structure pattern of the tribes with contrastive learning, and the second level to diffuse information based on the inter-tribe relations, achieving effective and efficient risk assessment. Extensive experiments on the real-world company dataset show that our method achieves significant improvements on financial risk assessment over previous competing methods. Also, the extensive ablation studies and visualization comprehensively show the effectiveness of our method.
    Image Shortcut Squeezing: Countering Perturbative Availability Poisons with Compression. (arXiv:2301.13838v1 [cs.CR])
    Perturbative availability poisoning (PAP) adds small changes to images to prevent their use for model training. Current research adopts the belief that practical and effective approaches to countering such poisons do not exist. In this paper, we argue that it is time to abandon this belief. We present extensive experiments showing that 12 state-of-the-art PAP methods are vulnerable to Image Shortcut Squeezing (ISS), which is based on simple compression. For example, on average, ISS restores the CIFAR-10 model accuracy to $81.73\%$, surpassing the previous best preprocessing-based countermeasures by $37.97\%$ absolute. ISS also (slightly) outperforms adversarial training and has higher generalizability to unseen perturbation norms and also higher efficiency. Our investigation reveals that the property of PAP perturbations depends on the type of surrogate model used for poison generation, and it explains why a specific ISS compression yields the best performance for a specific type of PAP perturbation. We further test stronger, adaptive poisoning, and show it falls short of being an ideal defense against ISS. Overall, our results demonstrate the importance of considering various (simple) countermeasures to ensure the meaningfulness of analysis carried out during the development of availability poisons.
    LogAI: A Library for Log Analytics and Intelligence. (arXiv:2301.13415v1 [cs.AI])
    Software and System logs record runtime information about processes executing within a system. These logs have become the most critical and ubiquitous forms of observability data that help developers understand system behavior, monitor system health and resolve issues. However, the volume of logs generated can be humongous (of the order of petabytes per day) especially for complex distributed systems, such as cloud, search engine, social media, etc. This has propelled a lot of research on developing AI-based log based analytics and intelligence solutions that can process huge volume of raw logs and generate insights. In order to enable users to perform multiple types of AI-based log analysis tasks in a uniform manner, we introduce LogAI (https://github.com/salesforce/logai), a one-stop open source library for log analytics and intelligence. LogAI supports tasks such as log summarization, log clustering and log anomaly detection. It adopts the OpenTelemetry data model, to enable compatibility with different log management platforms. LogAI provides a unified model interface and provides popular time-series, statistical learning and deep learning models. Alongside this, LogAI also provides an out-of-the-box GUI for users to conduct interactive analysis. With LogAI, we can also easily benchmark popular deep learning algorithms for log anomaly detection without putting in redundant effort to process the logs. We have opensourced LogAI to cater to a wide range of applications benefiting both academic research and industrial prototyping.
    Fourier Sensitivity and Regularization of Computer Vision Models. (arXiv:2301.13514v1 [cs.CV])
    Recent work has empirically shown that deep neural networks latch on to the Fourier statistics of training data and show increased sensitivity to Fourier-basis directions in the input. Understanding and modifying this Fourier-sensitivity of computer vision models may help improve their robustness. Hence, in this paper we study the frequency sensitivity characteristics of deep neural networks using a principled approach. We first propose a basis trick, proving that unitary transformations of the input-gradient of a function can be used to compute its gradient in the basis induced by the transformation. Using this result, we propose a general measure of any differentiable model's Fourier-sensitivity using the unitary Fourier-transform of its input-gradient. When applied to deep neural networks, we find that computer vision models are consistently sensitive to particular frequencies dependent on the dataset, training method and architecture. Based on this measure, we further propose a Fourier-regularization framework to modify the Fourier-sensitivities and frequency bias of models. Using our proposed regularizer-family, we demonstrate that deep neural networks obtain improved classification accuracy on robustness evaluations.
    Multicalibration as Boosting for Regression. (arXiv:2301.13767v1 [cs.LG])
    We study the connection between multicalibration and boosting for squared error regression. First we prove a useful characterization of multicalibration in terms of a ``swap regret'' like condition on squared error. Using this characterization, we give an exceedingly simple algorithm that can be analyzed both as a boosting algorithm for regression and as a multicalibration algorithm for a class H that makes use only of a standard squared error regression oracle for H. We give a weak learning assumption on H that ensures convergence to Bayes optimality without the need to make any realizability assumptions -- giving us an agnostic boosting algorithm for regression. We then show that our weak learning assumption on H is both necessary and sufficient for multicalibration with respect to H to imply Bayes optimality. We also show that if H satisfies our weak learning condition relative to another class C then multicalibration with respect to H implies multicalibration with respect to C. Finally we investigate the empirical performance of our algorithm experimentally using an open source implementation that we make available. Our code repository can be found at https://github.com/Declancharrison/Level-Set-Boosting.
    CMLCompiler: A Unified Compiler for Classical Machine Learning. (arXiv:2301.13441v1 [cs.LG])
    Classical machine learning (CML) occupies nearly half of machine learning pipelines in production applications. Unfortunately, it fails to utilize the state-of-the-practice devices fully and performs poorly. Without a unified framework, the hybrid deployments of deep learning (DL) and CML also suffer from severe performance and portability issues. This paper presents the design of a unified compiler, called CMLCompiler, for CML inference. We propose two unified abstractions: operator representations and extended computational graphs. The CMLCompiler framework performs the conversion and graph optimization based on two unified abstractions, then outputs an optimized computational graph to DL compilers or frameworks. We implement CMLCompiler on TVM. The evaluation shows CMLCompiler's portability and superior performance. It achieves up to 4.38x speedup on CPU, 3.31x speedup on GPU, and 5.09x speedup on IoT devices, compared to the state-of-the-art solutions -- scikit-learn, intel sklearn, and hummingbird. Our performance of CML and DL mixed pipelines achieves up to 3.04x speedup compared with cross-framework implementations.
    Real-Time Outlier Detection with Dynamic Process Limits. (arXiv:2301.13527v1 [cs.LG])
    Anomaly detection methods are part of the systems where rare events may endanger an operation's profitability, safety, and environmental aspects. Although many state-of-the-art anomaly detection methods were developed to date, their deployment is limited to the operation conditions present during the model training. Online anomaly detection brings the capability to adapt to data drifts and change points that may not be represented during model development resulting in prolonged service life. This paper proposes an online anomaly detection algorithm for existing real-time infrastructures where low-latency detection is required and novel patterns in data occur unpredictably. The online inverse cumulative distribution-based approach is introduced to eliminate common problems of offline anomaly detectors, meanwhile providing dynamic process limits to normal operation. The benefit of the proposed method is the ease of use, fast computation, and deployability as shown in two case studies of real microgrid operation data.
    Retiring $\Delta$DP: New Distribution-Level Metrics for Demographic Parity. (arXiv:2301.13443v1 [cs.LG])
    Demographic parity is the most widely recognized measure of group fairness in machine learning, which ensures equal treatment of different demographic groups. Numerous works aim to achieve demographic parity by pursuing the commonly used metric $\Delta DP$. Unfortunately, in this paper, we reveal that the fairness metric $\Delta DP$ can not precisely measure the violation of demographic parity, because it inherently has the following drawbacks: \textit{i)} zero-value $\Delta DP$ does not guarantee zero violation of demographic parity, \textit{ii)} $\Delta DP$ values can vary with different classification thresholds. To this end, we propose two new fairness metrics, \textsf{A}rea \textsf{B}etween \textsf{P}robability density function \textsf{C}urves (\textsf{ABPC}) and \textsf{A}rea \textsf{B}etween \textsf{C}umulative density function \textsf{C}urves (\textsf{ABCC}), to precisely measure the violation of demographic parity in distribution level. The new fairness metrics directly measure the difference between the distributions of the prediction probability for different demographic groups. Thus our proposed new metrics enjoy: \textit{i)} zero-value \textsf{ABCC}/\textsf{ABPC} guarantees zero violation of demographic parity; \textit{ii)} \textsf{ABCC}/\textsf{ABPC} guarantees demographic parity while the classification threshold adjusted. We further re-evaluate the existing fair models with our proposed fairness metrics and observe different fairness behaviors of those models under the new metrics.
    Quantum contextual bandits and recommender systems for quantum data. (arXiv:2301.13524v1 [quant-ph])
    We study a recommender system for quantum data using the linear contextual bandit framework. In each round, a learner receives an observable (the context) and has to recommend from a finite set of unknown quantum states (the actions) which one to measure. The learner has the goal of maximizing the reward in each round, that is the outcome of the measurement on the unknown state. Using this model we formulate the low energy quantum state recommendation problem where the context is a Hamiltonian and the goal is to recommend the state with the lowest energy. For this task, we study two families of contexts: the Ising model and a generalized cluster model. We observe that if we interpret the actions as different phases of the models then the recommendation is done by classifying the correct phase of the given Hamiltonian and the strategy can be interpreted as an online quantum phase classifier.
    Learning Data Representations with Joint Diffusion Models. (arXiv:2301.13622v1 [cs.LG])
    We introduce a joint diffusion model that simultaneously learns meaningful internal representations fit for both generative and predictive tasks. Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models in both generative and predictive settings. We then introduce an extension of the vanilla diffusion model with a classifier that allows for stable joint training with shared parametrization between those objectives. The resulting joint diffusion model offers superior performance across various tasks, including generative modeling, semi-supervised classification, and domain adaptation.
    Domain-Generalizable Multiple-Domain Clustering. (arXiv:2301.13530v1 [cs.LG])
    Accurately clustering high-dimensional measurements is vital for adequately analyzing scientific data. Deep learning machinery has remarkably improved clustering capabilities in recent years due to its ability to extract meaningful representations. In this work, we are given unlabeled samples from multiple source domains, and we aim to learn a shared classifier that assigns the examples to various clusters. Evaluation is done by using the classifier for predicting cluster assignments in a previously unseen domain. This setting generalizes the problem of unsupervised domain generalization to the case in which no supervised learning samples are given (completely unsupervised). Towards this goal, we present an end-to-end model and evaluate its capabilities on several multi-domain image datasets. Specifically, we demonstrate that our model is more accurate than schemes that require fine-tuning using samples from the target domain or some level of supervision.
    BALANCE: Bayesian Linear Attribution for Root Cause Localization. (arXiv:2301.13572v1 [cs.LG])
    Root Cause Analysis (RCA) plays an indispensable role in distributed data system maintenance and operations, as it bridges the gap between fault detection and system recovery. Existing works mainly study multidimensional localization or graph-based root cause localization. This paper opens up the possibilities of exploiting the recently developed framework of explainable AI (XAI) for the purpose of RCA. In particular, we propose BALANCE (BAyesian Linear AttributioN for root CausE localization), which formulates the problem of RCA through the lens of attribution in XAI and seeks to explain the anomalies in the target KPIs by the behavior of the candidate root causes. BALANCE consists of three innovative components. First, we propose a Bayesian multicollinear feature selection (BMFS) model to predict the target KPIs given the candidate root causes in a forward manner while promoting sparsity and concurrently paying attention to the correlation between the candidate root causes. Second, we introduce attribution analysis to compute the attribution score for each candidate in a backward manner. Third, we merge the estimated root causes related to each KPI if there are multiple KPIs. We extensively evaluate the proposed BALANCE method on one synthesis dataset as well as three real-world RCA tasks, that is, bad SQL localization, container fault localization, and fault type diagnosis for Exathlon. Results show that BALANCE outperforms the state-of-the-art (SOTA) methods in terms of accuracy with the least amount of running time, and achieves at least $6\%$ notably higher accuracy than SOTA methods for real tasks. BALANCE has been deployed to production to tackle real-world RCA problems, and the online results further advocate its usage for real-time diagnosis in distributed data systems.
    Affinity Uncertainty-based Hard Negative Mining in Graph Contrastive Learning. (arXiv:2301.13340v1 [cs.LG])
    Hard negative mining has shown effective in enhancing self-supervised contrastive learning (CL) on diverse data types, including graph contrastive learning (GCL). Existing hardness-aware CL methods typically treat negative instances that are most similar to the anchor instance as hard negatives, which helps improve the CL performance, especially on image data. However, this approach often fails to identify the hard negatives but leads to many false negatives on graph data. This is mainly due to that the learned graph representations are not sufficiently discriminative due to over-smooth representations and/or non-i.i.d. issues in graph data. To tackle this problem, this paper proposes a novel approach that builds a discriminative model on collective affinity information (i.e, two sets of pairwise affinities between the negative instances and the anchor instance) to mine hard negatives in GCL. In particular, the proposed approach evaluates how confident/uncertain the discriminative model is about the affinity of each negative instance to an anchor instance to determine its hardness weight relative to the anchor instance. This uncertainty information is then incorporated into existing GCL loss functions via a weighting term to enhance their performance. The enhanced GCL is theoretically grounded that the resulting GCL loss is equivalent to a triplet loss with an adaptive margin being exponentially proportional to the learned uncertainty of each negative instance. Extensive experiments on 10 graph datasets show that our approach i) consistently enhances different state-of-the-art GCL methods in both graph and node classification tasks, and ii) significantly improves their robustness against adversarial attacks.
    An investigation of challenges encountered when specifying training data and runtime monitors for safety critical ML applications. (arXiv:2301.13476v1 [cs.SE])
    Context and motivation: The development and operation of critical software that contains machine learning (ML) models requires diligence and established processes. Especially the training data used during the development of ML models have major influences on the later behaviour of the system. Runtime monitors are used to provide guarantees for that behaviour. Question / problem: We see major uncertainty in how to specify training data and runtime monitoring for critical ML models and by this specifying the final functionality of the system. In this interview-based study we investigate the underlying challenges for these difficulties. Principal ideas/results: Based on ten interviews with practitioners who develop ML models for critical applications in the automotive and telecommunication sector, we identified 17 underlying challenges in 6 challenge groups that relate to the challenge of specifying training data and runtime monitoring. Contribution: The article provides a list of the identified underlying challenges related to the difficulties practitioners experience when specifying training data and runtime monitoring for ML models. Furthermore, interconnection between the challenges were found and based on these connections recommendation proposed to overcome the root causes for the challenges.
    Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning. (arXiv:2301.13618v1 [cs.LG])
    Many real-time applications (e.g., Augmented/Virtual Reality, cognitive assistance) rely on Deep Neural Networks (DNNs) to process inference tasks. Edge computing is considered a key infrastructure to deploy such applications, as moving computation close to the data sources enables us to meet stringent latency and throughput requirements. However, the constrained nature of edge networks poses several additional challenges to the management of inference workloads: edge clusters can not provide unlimited processing power to DNN models, and often a trade-off between network and processing time should be considered when it comes to end-to-end delay requirements. In this paper, we focus on the problem of scheduling inference queries on DNN models in edge networks at short timescales (i.e., few milliseconds). By means of simulations, we analyze several policies in the realistic network settings and workloads of a large ISP, highlighting the need for a dynamic scheduling policy that can adapt to network conditions and workloads. We therefore design ASET, a Reinforcement Learning based scheduling algorithm able to adapt its decisions according to the system conditions. Our results show that ASET effectively provides the best performance compared to static policies when scheduling over a distributed pool of edge resources.
    Faster Predict-and-Optimize with Three-Operator Splitting. (arXiv:2301.13395v1 [cs.LG])
    In many practical settings, a combinatorial problem must be repeatedly solved with similar, but distinct parameters w. Yet, w is not directly observed; only contextual data d that correlates with w is available. It is tempting to use a neural network to predict w given d, but training such a model requires reconciling the discrete nature of combinatorial optimization with the gradient-based frameworks used to train neural networks. One approach to overcoming this issue is to consider a continuous relaxation of the combinatorial problem. While existing such approaches have shown to be highly effective on small problems (10-100 variables) they do not scale well to large problems. In this work, we show how recent results in operator splitting can be used to design such a system which is easy to train and scales effortlessly to problems with thousands of variables.
    Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (arXiv:2301.13486v1 [stat.ML])
    In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.
    Fine Robotic Manipulation without Force/Torque Sensor. (arXiv:2301.13413v1 [cs.RO])
    Force Sensing and Force Control are essential to many industrial applications. Typically, a 6-axis Force/Torque (F/T) sensor is mounted between the robot's wrist and the end-effector in order to measure the forces and torques exerted by the environment onto the robot (the external wrench). Although a typical 6-axis F/T sensor can provide highly accurate measurements, it is expensive and vulnerable to drift and external impacts. Existing methods aiming at estimating the external wrench using only the robot's internal signals are limited in scope: for example, wrench estimation accuracy was mostly validated in free-space motions and simple contacts as opposed to tasks like assembly that require high-precision force control. Here we present a Neural Network based method and argue that by devoting particular attention to the training data structure, it is possible to accurately estimate the external wrench in a wide range of scenarios based solely on internal signals. As an illustration, we demonstrate a pin insertion experiment with 100-micron clearance and a hand-guiding experiment, both performed without external F/T sensors or joint torque sensors. Our result opens the possibility of equipping the existing 2.7 million industrial robots with Force Sensing and Force Control capabilities without any additional hardware.
    Training with Mixed-Precision Floating-Point Assignments. (arXiv:2301.13464v1 [cs.LG])
    When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment -- a mapping from all tensors (arising in training) to precision levels (high or low) -- that keeps most of the tensors in low precision and leads to sufficiently accurate models. We provide a technique that explores this memory-accuracy tradeoff by generating precision assignments that (i) use less memory and (ii) lead to more accurate models at the same time, compared to the precision assignments considered by prior work in low-precision floating-point training. Our method typically provides > 2x memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading off accuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.
    Incorporating Recurrent Reinforcement Learning into Model Predictive Control for Adaptive Control in Autonomous Driving. (arXiv:2301.13313v1 [cs.LG])
    Model Predictive Control (MPC) is attracting tremendous attention in the autonomous driving task as a powerful control technique. The success of an MPC controller strongly depends on an accurate internal dynamics model. However, the static parameters, usually learned by system identification, often fail to adapt to both internal and external perturbations in real-world scenarios. In this paper, we firstly (1) reformulate the problem as a Partially Observed Markov Decision Process (POMDP) that absorbs the uncertainties into observations and maintains Markov property into hidden states; and (2) learn a recurrent policy continually adapting the parameters of the dynamics model via Recurrent Reinforcement Learning (RRL) for optimal and adaptive control; and (3) finally evaluate the proposed algorithm (referred as $\textit{MPC-RRL}$) in CARLA simulator and leading to robust behaviours under a wide range of perturbations.
    Population-wise Labeling of Sulcal Graphs using Multi-graph Matching. (arXiv:2301.13532v1 [stat.ML])
    Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
    Causality-based CTR Prediction using Graph Neural Networks. (arXiv:2301.12762v1 [cs.IR] CROSS LISTED)
    As a prevalent problem in online advertising, CTR prediction has attracted plentiful attention from both academia and industry. Recent studies have been reported to establish CTR prediction models in the graph neural networks (GNNs) framework. However, most of GNNs-based models handle feature interactions in a complete graph, while ignoring causal relationships among features, which results in a huge drop in the performance on out-of-distribution data. This paper is dedicated to developing a causality-based CTR prediction model in the GNNs framework (Causal-GNN) integrating representations of feature graph, user graph and ad graph in the context of online advertising. In our model, a structured representation learning method (GraphFwFM) is designed to capture high-order representations on feature graph based on causal discovery among field features in gated graph neural networks (GGNNs), and GraphSAGE is employed to obtain graph representations of users and ads. Experiments conducted on three public datasets demonstrate the superiority of Causal-GNN in AUC and Logloss and the effectiveness of GraphFwFM in capturing high-order representations on causal feature graph.
    Proxy-based Zero-Shot Entity Linking by Effective Candidate Retrieval. (arXiv:2301.13318v1 [cs.LG])
    A recent advancement in the domain of biomedical Entity Linking is the development of powerful two-stage algorithms, an initial candidate retrieval stage that generates a shortlist of entities for each mention, followed by a candidate ranking stage. However, the effectiveness of both stages are inextricably dependent on computationally expensive components. Specifically, in candidate retrieval via dense representation retrieval it is important to have hard negative samples, which require repeated forward passes and nearest neighbour searches across the entire entity label set throughout training. In this work, we show that pairing a proxy-based metric learning loss with an adversarial regularizer provides an efficient alternative to hard negative sampling in the candidate retrieval stage. In particular, we show competitive performance on the recall@1 metric, thereby providing the option to leave out the expensive candidate ranking step. Finally, we demonstrate how the model can be used in a zero-shot setting to discover out of knowledge base biomedical entities.
    A Framework for Adapting Offline Algorithms to Solve Combinatorial Multi-Armed Bandit Problems with Bandit Feedback. (arXiv:2301.13326v1 [cs.LG])
    We investigate the problem of stochastic, combinatorial multi-armed bandits where the learner only has access to bandit feedback and the reward function can be non-linear. We provide a general framework for adapting discrete offline approximation algorithms into sublinear $\alpha$-regret methods that only require bandit feedback, achieving $\mathcal{O}\left(T^\frac{2}{3}\log(T)^\frac{1}{3}\right)$ expected cumulative $\alpha$-regret dependence on the horizon $T$. The framework only requires the offline algorithms to be robust to small errors in function evaluation. The adaptation procedure does not even require explicit knowledge of the offline approximation algorithm -- the offline algorithm can be used as black box subroutine. To demonstrate the utility of the proposed framework, the proposed framework is applied to multiple problems in submodular maximization, adapting approximation algorithms for cardinality and for knapsack constraints. The new CMAB algorithms for knapsack constraints outperform a full-bandit method developed for the adversarial setting in experiments with real-world data.
    On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. (arXiv:2301.13370v1 [cs.LG])
    Recent work has shown that automatic differentiation over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of automatic differentiation when the parameter space of a neural network consists solely of machine-representable numbers. For a neural network with bias parameters, we prove that automatic differentiation is correct at all parameters where the network is differentiable. In contrast, it is incorrect at all parameters where the network is non-differentiable, since it never informs non-differentiability. To better understand this non-differentiable set of parameters, we prove a tight bound on its size, which is linear in the number of non-differentiabilities in activation functions, and provide a simple necessary and sufficient condition for a parameter to be in this set. We further prove that automatic differentiation always computes a Clarke subderivative, even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.
    Understanding Self-Distillation in the Presence of Label Noise. (arXiv:2301.13304v1 [cs.LG])
    Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.
    Continuous Spatiotemporal Transformers. (arXiv:2301.13338v1 [cs.LG])
    Modeling spatiotemporal dynamical systems is a fundamental challenge in machine learning. Transformer models have been very successful in NLP and computer vision where they provide interpretable representations of data. However, a limitation of transformers in modeling continuous dynamical systems is that they are fundamentally discrete time and space models and thus have no guarantees regarding continuous sampling. To address this challenge, we present the Continuous Spatiotemporal Transformer (CST), a new transformer architecture that is designed for the modeling of continuous systems. This new framework guarantees a continuous and smooth output via optimization in Sobolev space. We benchmark CST against traditional transformers as well as other spatiotemporal dynamics modeling methods and achieve superior performance in a number of tasks on synthetic and real systems, including learning brain dynamics from calcium imaging data.
    Deep Learning for Reference-Free Geolocation for Poplar Trees. (arXiv:2301.13387v1 [q-bio.GN])
    A core task in precision agriculture is the identification of climatic and ecological conditions that are advantageous for a given crop. The most succinct approach is geolocation, which is concerned with locating the native region of a given sample based on its genetic makeup. Here, we investigate genomic geolocation of Populus trichocarpa, or poplar, which has been identified by the US Department of Energy as a fast-rotation biofuel crop to be harvested nationwide. In particular, we approach geolocation from a reference-free perspective, circumventing the need for compute-intensive processes such as variant calling and alignment. Our model, MashNet, predicts latitude and longitude for poplar trees from randomly-sampled, unaligned sequence fragments. We show that our model performs comparably to Locator, a state-of-the-art method based on aligned whole-genome sequence data. MashNet achieves an error of 34.0 km^2 compared to Locator's 22.1 km^2. MashNet allows growers to quickly and efficiently identify natural varieties that will be most productive in their growth environment based on genotype. This paper explores geolocation for precision agriculture while providing a framework and data source for further development by the machine learning community.
    Combinatorial Causal Bandits without Graph Skeleton. (arXiv:2301.13392v1 [cs.LG])
    In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.
    Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. (arXiv:2301.13303v1 [stat.ML])
    To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.
    Scaling laws for single-agent reinforcement learning. (arXiv:2301.13442v1 [cs.LG])
    Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.
    GDOD: Effective Gradient Descent using Orthogonal Decomposition for Multi-Task Learning. (arXiv:2301.13465v1 [cs.LG])
    Multi-task learning (MTL) aims at solving multiple related tasks simultaneously and has experienced rapid growth in recent years. However, MTL models often suffer from performance degeneration with negative transfer due to learning several tasks simultaneously. Some related work attributed the source of the problem is the conflicting gradients. In this case, it is needed to select useful gradient updates for all tasks carefully. To this end, we propose a novel optimization approach for MTL, named GDOD, which manipulates gradients of each task using an orthogonal basis decomposed from the span of all task gradients. GDOD decomposes gradients into task-shared and task-conflict components explicitly and adopts a general update rule for avoiding interference across all task gradients. This allows guiding the update directions depending on the task-shared components. Moreover, we prove the convergence of GDOD theoretically under both convex and non-convex assumptions. Experiment results on several multi-task datasets not only demonstrate the significant improvement of GDOD performed to existing MTL models but also prove that our algorithm outperforms state-of-the-art optimization methods in terms of AUC and Logloss metrics.
    Demystifying Disagreement-on-the-Line in High Dimensions. (arXiv:2301.13371v1 [stat.ML])
    Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have lead to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.
    Quantized Neural Networks for Low-Precision Accumulation with Guaranteed Overflow Avoidance. (arXiv:2301.13376v1 [cs.LG])
    We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.
    CRISP: Curriculum based Sequential Neural Decoders for Polar Code Family. (arXiv:2210.00313v2 [cs.IT] UPDATED)
    Polar codes are widely used state-of-the-art codes for reliable communication that have recently been included in the 5th generation wireless standards (5G). However, there remains room for the design of polar decoders that are both efficient and reliable in the short blocklength regime. Motivated by recent successes of data-driven channel decoders, we introduce a novel $\textbf{C}$ur$\textbf{RI}$culum based $\textbf{S}$equential neural decoder for $\textbf{P}$olar codes (CRISP). We design a principled curriculum, guided by information-theoretic insights, to train CRISP and show that it outperforms the successive-cancellation (SC) decoder and attains near-optimal reliability performance on the Polar(32,16) and Polar(64,22) codes. The choice of the proposed curriculum is critical in achieving the accuracy gains of CRISP, as we show by comparing against other curricula. More notably, CRISP can be readily extended to Polarization-Adjusted-Convolutional (PAC) codes, where existing SC decoders are significantly less reliable. To the best of our knowledge, CRISP constructs the first data-driven decoder for PAC codes and attains near-optimal performance on the PAC(32,16) code.
    Autobidders with Budget and ROI Constraints: Efficiency, Regret, and Pacing Dynamics. (arXiv:2301.13306v1 [cs.GT])
    We study a game between autobidding algorithms that compete in an online advertising platform. Each autobidder is tasked with maximizing its advertiser's total value over multiple rounds of a repeated auction, subject to budget and/or return-on-investment constraints. We propose a gradient-based learning algorithm that is guaranteed to satisfy all constraints and achieves vanishing individual regret. Our algorithm uses only bandit feedback and can be used with the first- or second-price auction, as well as with any "intermediate" auction format. Our main result is that when these autobidders play against each other, the resulting expected liquid welfare over all rounds is at least half of the expected optimal liquid welfare achieved by any allocation. This holds whether or not the bidding dynamics converges to an equilibrium and regardless of the correlation structure between advertiser valuations.
    GeneFormer: Learned Gene Compression using Transformer-based Context Modeling. (arXiv:2212.08379v3 [cs.LG] UPDATED)
    With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like G-zip, which fails to utilize the interrelation of nucleotide sequence. Recently, many researchers begin to investigate deep learning based gene data compression method. In this paper, we propose a transformer-based gene compression method named GeneFormer. Specifically, we first introduce a modified transformer structure to fully explore the nucleotide sequence dependency. Then, we propose fixed-length parallel grouping to accelerate the decoding speed of our autoregressive model. Experimental results on real-world datasets show that our method saves 29.7% bit rate compared with the state-of-the-art method, and the decoding speed is significantly faster than all existing learning-based gene compression methods.
    Time Series Forecasting via Semi-Asymmetric Convolutional Architecture with Global Atrous Sliding Window. (arXiv:2301.13691v1 [cs.AI])
    The proposed method in this paper is designed to address the problem of time series forecasting. Although some exquisitely designed models achieve excellent prediction performances, how to extract more useful information and make accurate predictions is still an open issue. Most of modern models only focus on a short range of information, which are fatal for problems such as time series forecasting which needs to capture long-term information characteristics. As a result, the main concern of this work is to further mine relationship between local and global information contained in time series to produce more precise predictions. In this paper, to satisfactorily realize the purpose, we make three main contributions that are experimentally verified to have performance advantages. Firstly, original time series is transformed into difference sequence which serves as input to the proposed model. And secondly, we introduce the global atrous sliding window into the forecasting model which references the concept of fuzzy time series to associate relevant global information with temporal data within a time period and utilizes central-bidirectional atrous algorithm to capture underlying-related features to ensure validity and consistency of captured data. Thirdly, a variation of widely-used asymmetric convolution which is called semi-asymmetric convolution is devised to more flexibly extract relationships in adjacent elements and corresponding associated global features with adjustable ranges of convolution on vertical and horizontal directions. The proposed model in this paper achieves state-of-the-art on most of time series datasets provided compared with competitive modern models.
    Grounding Language Models to Images for Multimodal Generation. (arXiv:2301.13823v1 [cs.CL])
    We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
    Active Learning-based Domain Adaptive Localized Polynomial Chaos Expansion. (arXiv:2301.13635v1 [cs.LG])
    The paper presents a novel methodology to build surrogate models of complicated functions by an active learning-based sequential decomposition of the input random space and construction of localized polynomial chaos expansions, referred to as domain adaptive localized polynomial chaos expansion (DAL-PCE). The approach utilizes sequential decomposition of the input random space into smaller sub-domains approximated by low-order polynomial expansions. This allows approximation of functions with strong nonlinearties, discontinuities, and/or singularities. Decomposition of the input random space and local approximations alleviates the Gibbs phenomenon for these types of problems and confines error to a very small vicinity near the non-linearity. The global behavior of the surrogate model is therefore significantly better than existing methods as shown in numerical examples. The whole process is driven by an active learning routine that uses the recently proposed $\Theta$ criterion to assess local variance contributions. The proposed approach balances both \emph{exploitation} of the surrogate model and \emph{exploration} of the input random space and thus leads to efficient and accurate approximation of the original mathematical model. The numerical results show the superiority of the DAL-PCE in comparison to (i) a single global polynomial chaos expansion and (ii) the recently proposed stochastic spectral embedding (SSE) method developed as an accurate surrogate model and which is based on a similar domain decomposition process. This method represents general framework upon which further extensions and refinements can be based, and which can be combined with any technique for non-intrusive polynomial chaos expansion construction.
    Enhancing Hyper-To-Real Space Projections Through Euclidean Norm Meta-Heuristic Optimization. (arXiv:2301.13671v1 [cs.LG])
    The continuous computational power growth in the last decades has made solving several optimization problems significant to humankind a tractable task; however, tackling some of them remains a challenge due to the overwhelming amount of candidate solutions to be evaluated, even by using sophisticated algorithms. In such a context, a set of nature-inspired stochastic methods, called meta-heuristic optimization, can provide robust approximate solutions to different kinds of problems with a small computational burden, such as derivative-free real function optimization. Nevertheless, these methods may converge to inadequate solutions if the function landscape is too harsh, e.g., enclosing too many local optima. Previous works addressed this issue by employing a hypercomplex representation of the search space, like quaternions, where the landscape becomes smoother and supposedly easier to optimize. Under this approach, meta-heuristic computations happen in the hypercomplex space, whereas variables are mapped back to the real domain before function evaluation. Despite this latter operation being performed by the Euclidean norm, we have found that after the optimization procedure has finished, it is usually possible to obtain even better solutions by employing the Minkowski $p$-norm instead and fine-tuning $p$ through an auxiliary sub-problem with neglecting additional cost and no hyperparameters. Such behavior was observed in eight well-established benchmarking functions, thus fostering a new research direction for hypercomplex meta-heuristic optimization.
    Complete Neural Networks for Euclidean Graphs. (arXiv:2301.13821v1 [cs.LG])
    We propose a 2-WL-like geometric graph isomorphism test and prove it is complete when applied to Euclidean Graphs in $\mathbb{R}^3$. We then use recent results on multiset embeddings to devise an efficient geometric GNN model with equivalent separation power. We verify empirically that our GNN model is able to separate particularly challenging synthetic examples, and demonstrate its usefulness for a chemical property prediction problem.
    Archetypal Analysis++: Rethinking the Initialization Strategy. (arXiv:2301.13748v1 [cs.LG])
    Archetypal analysis is a matrix factorization method with convexity constraints. Due to local minima, a good initialization is essential. Frequently used initialization methods yield either sub-optimal starting points or are prone to get stuck in poor local minima. In this paper, we propose archetypal analysis++ (AA++), a probabilistic initialization strategy for archetypal analysis that sequentially samples points based on their influence on the objective, similar to $k$-means++. In fact, we argue that $k$-means++ already approximates the proposed initialization method. Furthermore, we suggest to adapt an efficient Monte Carlo approximation of $k$-means++ to AA++. In an extensive empirical evaluation of 13 real-world data sets of varying sizes and dimensionalities and considering two pre-processing strategies, we show that AA++ almost consistently outperforms all baselines, including the most frequently used ones.
    Deep learning-based lung segmentation and automatic regional template in chest X-ray images for pediatric tuberculosis. (arXiv:2301.13786v1 [eess.IV])
    Tuberculosis (TB) is still considered a leading cause of death and a substantial threat to global child health. Both TB infection and disease are curable using antibiotics. However, most children who die of TB are never diagnosed or treated. In clinical practice, experienced physicians assess TB by examining chest X-rays (CXR). Pediatric CXR has specific challenges compared to adult CXR, which makes TB diagnosis in children more difficult. Computer-aided diagnosis systems supported by Artificial Intelligence have shown performance comparable to experienced radiologist TB readings, which could ease mass TB screening and reduce clinical burden. We propose a multi-view deep learning-based solution which, by following a proposed template, aims to automatically regionalize and extract lung and mediastinal regions of interest from pediatric CXR images where key TB findings may be present. Experimental results have shown accurate region extraction, which can be used for further analysis to confirm TB finding presence and severity assessment. Code publicly available at https://github.com/dani-capellan/pTB_LungRegionExtractor.
    An Efficient Solution to s-Rectangular Robust Markov Decision Processes. (arXiv:2301.13642v1 [cs.LG])
    We present an efficient robust value iteration for \texttt{s}-rectangular robust Markov Decision Processes (MDPs) with a time complexity comparable to standard (non-robust) MDPs which is significantly faster than any existing method. We do so by deriving the optimal robust Bellman operator in concrete forms using our $L_p$ water filling lemma. We unveil the exact form of the optimal policies, which turn out to be novel threshold policies with the probability of playing an action proportional to its advantage.
    Improved distinct bone segmentation in upper-body CT through multi-resolution networks. (arXiv:2301.13674v1 [eess.IV])
    Purpose: Automated distinct bone segmentation from CT scans is widely used in planning and navigation workflows. U-Net variants are known to provide excellent results in supervised semantic segmentation. However, in distinct bone segmentation from upper body CTs a large field of view and a computationally taxing 3D architecture are required. This leads to low-resolution results lacking detail or localisation errors due to missing spatial context when using high-resolution inputs. Methods: We propose to solve this problem by using end-to-end trainable segmentation networks that combine several 3D U-Nets working at different resolutions. Our approach, which extends and generalizes HookNet and MRN, captures spatial information at a lower resolution and skips the encoded information to the target network, which operates on smaller high-resolution inputs. We evaluated our proposed architecture against single resolution networks and performed an ablation study on information concatenation and the number of context networks. Results: Our proposed best network achieves a median DSC of 0.86 taken over all 125 segmented bone classes and reduces the confusion among similar-looking bones in different locations. These results outperform our previously published 3D U-Net baseline results on the task and distinct-bone segmentation results reported by other groups. Conclusion: The presented multi-resolution 3D U-Nets address current shortcomings in bone segmentation from upper-body CT scans by allowing for capturing a larger field of view while avoiding the cubic growth of the input pixels and intermediate computations that quickly outgrow the computational capacities in 3D. The approach thus improves the accuracy and efficiency of distinct bone segmentation from upper-body CT.
    Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. (arXiv:2301.13826v1 [cs.CV])
    Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.
    Anti-Exploration by Random Network Distillation. (arXiv:2301.13616v1 [cs.LG])
    Despite the success of Random Network Distillation (RND) in various domains, it was shown as not discriminative enough to be used as an uncertainty estimator for penalizing out-of-distribution actions in offline reinforcement learning. In this paper, we revisit these results and show that, with a naive choice of conditioning for the RND prior, it becomes infeasible for the actor to effectively minimize the anti-exploration bonus and discriminativity is not an issue. We show that this limitation can be avoided with conditioning based on Feature-wise Linear Modulation (FiLM), resulting in a simple and efficient ensemble-free algorithm based on Soft Actor-Critic. We evaluate it on the D4RL benchmark, showing that it is capable of achieving performance comparable to ensemble-based methods and outperforming ensemble-free approaches by a wide margin.
    UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers. (arXiv:2301.13741v1 [cs.CV])
    Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.
    Semi-Supervised Classification with Graph Convolutional Kernel Machines. (arXiv:2301.13764v1 [cs.LG])
    We present a deep Graph Convolutional Kernel Machine (GCKM) for semi-supervised node classification in graphs. First, we introduce an unsupervised kernel machine propagating the node features in a one-hop neighbourhood. Then, we specify a semi-supervised classification kernel machine through the lens of the Fenchel-Young inequality. The deep graph convolutional kernel machine is obtained by stacking multiple shallow kernel machines. After showing that unsupervised and semi-supervised layer corresponds to an eigenvalue problem and a linear system on the aggregated node features, respectively, we derive an efficient end-to-end training algorithm in the dual variables. Numerical experiments demonstrate that our approach is competitive with state-of-the-art graph neural networks for homophilious and heterophilious benchmark datasets. Notably, GCKM achieves superior performance when very few labels are available.
    Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning. (arXiv:2301.13703v1 [cs.LG])
    Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $\alpha$ are varied. For gradient descent, $\alpha$ is a key parameter that controls if the network is `lazy' ($\alpha\gg 1$) or instead learns features ($\alpha\ll 1$). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the $(\alpha,T)$ plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing $T$ or decreasing $\alpha$ both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that key dynamical quantities (including the total variations of weights during training) depend on both $T$ and $P$ as power laws, and the characteristic temperature $T_c$, where the noise of SGD starts affecting performance, is a power law of $P$. These observations indicate that a key effect of SGD noise occurs late in training, by affecting the stopping process whereby all data are fitted. We argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. The same effect occurs at larger training set $P$. We confirm this view in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.
    Simplex Random Features. (arXiv:2301.13856v1 [stat.ML])
    We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometrically-coupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features at no observable extra cost. We present a more computationally expensive SimRFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
    Alternating Updates for Efficient Transformers. (arXiv:2301.13310v1 [cs.LG])
    It is well established that increasing scale in deep transformer networks leads to improved quality and performance. This increase in scale often comes with an increase in compute cost and inference latency. Consequently, research into methods which help realize the benefits of increased scale without leading to an increase in the compute cost becomes important. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation without increasing the computation time by working on a subblock of the representation at each layer. Our experiments on various transformer models and language tasks demonstrate the consistent effectiveness of alternating updates on a diverse set of benchmarks. Finally, we present extensions of AltUp to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity.
    Differentially Private Distributed Bayesian Linear Regression with MCMC. (arXiv:2301.13778v1 [stat.ML])
    We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
    Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback. (arXiv:2301.13791v1 [stat.ML])
    We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
    A Data-Driven Modeling and Control Framework for Physics-Based Building Emulators. (arXiv:2301.13447v1 [eess.SY])
    We present a data-driven modeling and control framework for physics-based building emulators. Our approach comprises: (a) Offline training of differentiable surrogate models that speed up model evaluations, provide cheap gradients, and have good predictive accuracy for the receding horizon in Model Predictive Control (MPC) and (b) Formulating and solving nonlinear building HVAC MPC problems. We extensively verify the modeling and control performance using multiple surrogate models and optimization frameworks for different available test cases in the Building Optimization Testing Framework (BOPTEST). The framework is compatible with other modeling techniques and customizable with different control formulations. The modularity makes the approach future-proof for test cases currently in development for physics-based building emulators and provides a path toward prototyping predictive controllers in large buildings.
    Sport Task: Fine Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2022. (arXiv:2301.13576v1 [cs.AI])
    Sports video analysis is a widespread research topic. Its applications are very diverse, like events detection during a match, video summary, or fine-grained movement analysis of athletes. As part of the MediaEval 2022 benchmarking initiative, this task aims at detecting and classifying subtle movements from sport videos. We focus on recordings of table tennis matches. Conducted since 2019, this task provides a classification challenge from untrimmed videos recorded under natural conditions with known temporal boundaries for each stroke. Since 2021, the task also provides a stroke detection challenge from unannotated, untrimmed videos. This year, the training, validation, and test sets are enhanced to ensure that all strokes are represented in each dataset. The dataset is now similar to the one used in [1, 2]. This research is intended to build tools for coaches and athletes who want to further evaluate their sport performances.
    Execution-based Code Generation using Deep Reinforcement Learning. (arXiv:2301.13816v1 [cs.LG])
    The utilization of programming language (PL) models, pretrained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting specific sequence-level features of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that combines pretrained PL models with Proximal Policy Optimization (PPO) deep reinforcement learning and employs execution feedback as the external source of knowledge into the model optimization. PPOCoder is transferable across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, improving the success rate of compilation and functional correctness over different PLs. Our code can be found at https://github.com/reddy-lab-code-research/PPOCoder .
    Causal-Discovery Performance of ChatGPT in the context of Neuropathic Pain Diagnosis. (arXiv:2301.13819v1 [cs.CL])
    ChatGPT has demonstrated exceptional proficiency in natural language conversation, e.g., it can answer a wide range of questions while no previous large language models can. Thus, we would like to push its limit and explore its ability to answer causal discovery questions by using a medical benchmark (Tu et al. 2019) in causal discovery.
    A Bayesian Generative Adversarial Network (GAN) to Generate Synthetic Time-Series Data, Application in Combined Sewer Flow Prediction. (arXiv:2301.13733v1 [cs.LG])
    Despite various breakthroughs in machine learning and data analysis techniques for improving smart operation and management of urban water infrastructures, some key limitations obstruct this progress. Among these shortcomings, the absence of freely available data due to data privacy or high costs of data gathering and the nonexistence of adequate rare or extreme events in the available data plays a crucial role. Here, Generative Adversarial Networks (GANs) can help overcome these challenges. In machine learning, generative models are a class of methods capable of learning data distribution to generate artificial data. In this study, we developed a GAN model to generate synthetic time series to balance our limited recorded time series data and improve the accuracy of a data-driven model for combined sewer flow prediction. We considered the sewer system of a small town in Germany as the test case. Precipitation and inflow to the storage tanks are used for the Data-Driven model development. The aim is to predict the flow using precipitation data and examine the impact of data augmentation using synthetic data in model performance. Results show that GAN can successfully generate synthetic time series from real data distribution, which helps more accurate peak flow prediction. However, the model without data augmentation works better for dry weather prediction. Therefore, an ensemble model is suggested to combine the advantages of both models.
    Improving Monte Carlo Evaluation with Offline Data. (arXiv:2301.13734v1 [cs.LG])
    Monte Carlo (MC) methods are the most widely used methods to estimate the performance of a policy. Given an interested policy, MC methods give estimates by repeatedly running this policy to collect samples and taking the average of the outcomes. Samples collected during this process are called online samples. To get an accurate estimate, MC methods consume massive online samples. When online samples are expensive, e.g., online recommendations and inventory management, we want to reduce the number of online samples while achieving the same estimate accuracy. To this end, we use off-policy MC methods that evaluate the interested policy by running a different policy called behavior policy. We design a tailored behavior policy such that the variance of the off-policy MC estimator is provably smaller than the ordinary MC estimator. Importantly, this tailored behavior policy can be efficiently learned from existing offline data, i,e., previously logged data, which are much cheaper than online samples. With reduced variance, our off-policy MC method requires fewer online samples to evaluate the performance of a policy compared with the ordinary MC method. Moreover, our off-policy MC estimator is always unbiased.
    A Survey of Explainable AI in Deep Visual Modeling: Methods and Metrics. (arXiv:2301.13445v1 [cs.CV])
    Deep visual models have widespread applications in high-stake domains. Hence, their black-box nature is currently attracting a large interest of the research community. We present the first survey in Explainable AI that focuses on the methods and metrics for interpreting deep visual models. Covering the landmark contributions along the state-of-the-art, we not only provide a taxonomic organization of the existing techniques, but also excavate a range of evaluation metrics and collate them as measures of different properties of model explanations. Along the insightful discussion on the current trends, we also discuss the challenges and future avenues for this research direction.
    Learning Generalized Hybrid Proximity Representation for Image Recognition. (arXiv:2301.13459v1 [cs.CV])
    Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
    Convolutional autoencoder for the spatiotemporal latent representation of turbulence. (arXiv:2301.13728v1 [physics.flu-dyn])
    Turbulence is characterised by chaotic dynamics and a high-dimensional state space, which make the phenomenon challenging to predict. However, turbulent flows are often characterised by coherent spatiotemporal structures, such as vortices or large-scale modes, which can help obtain a latent description of turbulent flows. However, current approaches are often limited by either the need to use some form of thresholding on quantities defining the isosurfaces to which the flow structures are associated or the linearity of traditional modal flow decomposition approaches, such as those based on proper orthogonal decomposition. This problem is exacerbated in flows that exhibit extreme events, which are rare and sudden changes in a turbulent state. The goal of this paper is to obtain an efficient and accurate reduced-order latent representation of a turbulent flow that exhibits extreme events. Specifically, we employ a three-dimensional multiscale convolutional autoencoder (CAE) to obtain such latent representation. We apply it to a three-dimensional turbulent flow. We show that the Multiscale CAE is efficient, requiring less than 10% degrees of freedom than proper orthogonal decomposition for compressing the data and is able to accurately reconstruct flow states related to extreme events. The proposed deep learning architecture opens opportunities for nonlinear reduced-order modeling of turbulent flows from data.
    Skill Decision Transformer. (arXiv:2301.13573v1 [cs.LG])
    Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem (Chen et al., 2021; Janner et al., 2021). However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) (Furuta et al., 2021) have shown that utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling (Andrychowicz et al., 2017) and skill discovery methods to discover a diverse set of primitive behaviors, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark. The code and videos can be found on our project page: https://github.com/shyamsn97/skill-dt
    On the Initialisation of Wide Low-Rank Feedforward Neural Networks. (arXiv:2301.13710v1 [stat.ML])
    The edge-of-chaos dynamics of wide randomly initialized low-rank feedforward networks are analyzed. Formulae for the optimal weight and bias variances are extended from the full-rank to low-rank setting and are shown to follow from multiplicative scaling. The principle second order effect, the variance of the input-output Jacobian, is derived and shown to increase as the rank to width ratio decreases. These results inform practitioners how to randomly initialize feedforward networks with a reduced number of learnable parameters while in the same ambient dimension, allowing reductions in the computational cost and memory constraints of the associated network.
    Mathematical Capabilities of ChatGPT. (arXiv:2301.13867v1 [cs.LG])
    We investigate the mathematical capabilities of ChatGPT by testing it on publicly available datasets, as well as hand-crafted ones, and measuring its performance against other models trained on a mathematical corpus, such as Minerva. We also test whether ChatGPT can be a useful assistant to professional mathematicians by emulating various use cases that come up in the daily professional activities of mathematicians (question answering, theorem searching). In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, only cover elementary mathematics. We address this issue by introducing a new dataset: GHOSTS. It is the first natural-language dataset made and curated by working researchers in mathematics that (1) aims to cover graduate-level mathematics and (2) provides a holistic overview of the mathematical capabilities of language models. We benchmark ChatGPT on GHOSTS and evaluate performance against fine-grained criteria. We make this new dataset publicly available to assist a community-driven comparison of ChatGPT with (future) large language models in terms of advanced mathematical comprehension. We conclude that contrary to many positive reports in the media (a potential case of selection bias), ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Our results show that ChatGPT often understands the question but fails to provide correct solutions. Hence, if your goal is to use it to pass a university exam, you would be better off copying from your average peer!
    Large Music Recommendation Studies for Small Teams. (arXiv:2301.13388v1 [cs.HC])
    Running live music recommendation studies without direct industry partnerships can be a prohibitively daunting task, especially for small teams. In order to help future researchers interested in such evaluations, we present a number of struggles we faced in the process of generating our own such evaluation system alongside potential solutions. These problems span the topics of users, data, computation, and application architecture.
    Few-Shot Image-to-Semantics Translation for Policy Transfer in Reinforcement Learning. (arXiv:2301.13343v1 [cs.LG])
    We investigate policy transfer using image-to-semantics translation to mitigate learning difficulties in vision-based robotics control agents. This problem assumes two environments: a simulator environment with semantics, that is, low-dimensional and essential information, as the state space, and a real-world environment with images as the state space. By learning mapping from images to semantics, we can transfer a policy, pre-trained in the simulator, to the real world, thereby eliminating real-world on-policy agent interactions to learn, which are costly and risky. In addition, using image-to-semantics mapping is advantageous in terms of the computational efficiency to train the policy and the interpretability of the obtained policy over other types of sim-to-real transfer strategies. To tackle the main difficulty in learning image-to-semantics mapping, namely the human annotation cost for producing a training dataset, we propose two techniques: pair augmentation with the transition function in the simulator environment and active learning. We observed a reduction in the annotation cost without a decline in the performance of the transfer, and the proposed approach outperformed the existing approach without annotation.
    A Scalable, Interpretable, Verifiable & Differentiable Logic Gate Convolutional Neural Network Architecture From Truth Tables. (arXiv:2208.08609v2 [cs.AI] UPDATED)
    We propose $\mathcal{T}$ruth $\mathcal{T}$able net ($\mathcal{TT}$net), a novel Convolutional Neural Network (CNN) architecture that addresses, by design, the open challenges of interpretability, formal verification, and logic gate conversion. $\mathcal{TT}$net is built using CNNs' filters that are equivalent to tractable truth tables and that we call Learning Truth Table (LTT) blocks. The dual form of LTT blocks allows the truth tables to be easily trained with gradient descent and makes these CNNs easy to interpret, verify and infer. Specifically, $\mathcal{TT}$net is a deep CNN model that can be automatically represented, after post-training transformation, as a sum of Boolean decision trees, or as a sum of Disjunctive/Conjunctive Normal Form (DNF/CNF) formulas, or as a compact Boolean logic circuit. We demonstrate the effectiveness and scalability of $\mathcal{TT}$net on multiple datasets, showing comparable interpretability to decision trees, fast complete/sound formal verification, and scalable logic gate representation, all compared to state-of-the-art methods. We believe this work represents a step towards making CNNs more transparent and trustworthy for real-world critical applications.
    Self-Compressing Neural Networks. (arXiv:2301.13142v2 [cs.LG] UPDATED)
    This work focuses on reducing neural network size, which is a major driver of neural network execution time, power consumption, bandwidth, and memory footprint. A key challenge is to reduce size in a manner that can be exploited readily for efficient training and inference without the need for specialized hardware. We propose Self-Compression: a simple, general method that simultaneously achieves two goals: (1) removing redundant weights, and (2) reducing the number of bits required to represent the remaining weights. This is achieved using a generalized loss function to minimize overall network size. In our experiments we demonstrate floating point accuracy with as few as 3% of the bits and 18% of the weights remaining in the network.
    A Reinforcement Learning Framework for Dynamic Mediation Analysis. (arXiv:2301.13348v1 [stat.ML])
    Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.
    Clustering the Sketch: A Novel Approach to Embedding Table Compression. (arXiv:2210.05974v2 [cs.LG] UPDATED)
    Embedding tables are used by machine learning systems to work with categorical features. These tables can become exceedingly large in modern recommendation systems, necessitating the development of new methods for fitting them in memory, even during training. The best previous methods for table compression are so called "post training" quantization schemes such as "product" and "residual" quantization (Gray & Neuhoff, 1998). These methods replace table rows with references to k-means clustered "codewords". Unfortunately, clustering requires prior knowledge of the table to be compressed, which limits the memory savings to inference time and not training time. Hence, recent work, like the QR method (Shi et al., 2020), has used random references (linear sketching), which can be computed with hash functions before training. Unfortunately, the compression achieved is inferior to that achieved by post-training quantization. The new algorithm, CQR, shows how to get the best of two worlds by combining clustering and sketching: First IDs are randomly assigned to a codebook and codewords are trained (end to end) for an epoch. Next, we expand the codebook and apply clustering to reduce the size again. Finally, we add new random references and continue training. We show experimentally close to those of post-training quantization with the training time memory reductions of sketch-based methods, and we prove that our method always converges to the optimal embedding table for least-squares training.
    The Fair Value of Data Under Heterogeneous Privacy Constraints. (arXiv:2301.13336v1 [cs.LG])
    Modern data aggregation often takes the form of a platform collecting data from a network of users. More than ever, these users are now requesting that the data they provide is protected with a guarantee of privacy. This has led to the study of optimal data acquisition frameworks, where the optimality criterion is typically the maximization of utility for the agent trying to acquire the data. This involves determining how to allocate payments to users for the purchase of their data at various privacy levels. The main goal of this paper is to characterize a fair amount to pay users for their data at a given privacy level. We propose an axiomatic definition of fairness, analogous to the celebrated Shapley value. Two concepts for fairness are introduced. The first treats the platform and users as members of a common coalition and provides a complete description of how to divide the utility among the platform and users. In the second concept, fairness is defined only among users, leading to a potential fairness-constrained mechanism design problem for the platform. We consider explicit examples involving private heterogeneous data and show how these notions of fairness can be applied. To the best of our knowledge, these are the first fairness concepts for data that explicitly consider privacy constraints.
    Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference. (arXiv:2301.13330v1 [cs.LG])
    For effective and efficient deep neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. It is generally desirable to quantize as aggressively as possible without incurring significant accuracy degradation. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers of a network to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50 and ResNet-101 classification networks, demonstrating improved performance across the entire accuracy-throughput frontier, and equivalent performance for the PSPNet segmentation network in our own commensurate comparison over leading mixed precision layer selection techniques, while requiring orders of magnitude less compute time to reach a solution.
    Automated Sentiment and Hate Speech Analysis of Facebook Data by Employing Multilingual Transformer Models. (arXiv:2301.13668v1 [cs.CL])
    In recent years, there has been a heightened consensus within academia and in the public discourse that Social Media Platforms (SMPs), amplify the spread of hateful and negative sentiment content. Researchers have identified how hateful content, political propaganda, and targeted messaging contributed to real-world harms including insurrections against democratically elected governments, genocide, and breakdown of social cohesion due to heightened negative discourse towards certain communities in parts of the world. To counter these issues, SMPs have created semi-automated systems that can help identify toxic speech. In this paper we analyse the statistical distribution of hateful and negative sentiment contents within a representative Facebook dataset (n= 604,703) scrapped through 648 public Facebook pages which identify themselves as proponents (and followers) of far-right Hindutva actors. These pages were identified manually using keyword searches on Facebook and on CrowdTangleand classified as far-right Hindutva pages based on page names, page descriptions, and discourses shared on these pages. We employ state-of-the-art, open-source XLM-T multilingual transformer-based language models to perform sentiment and hate speech analysis of the textual contents shared on these pages over a period of 5.5 years. The result shows the statistical distributions of the predicted sentiment and the hate speech labels; top actors, and top page categories. We further discuss the benchmark performances and limitations of these pre-trained language models.
    Quantifying and Managing Impacts of Concept Drifts on IoT Traffic Inference in Residential ISP Networks. (arXiv:2301.06695v2 [cs.LG] UPDATED)
    Millions of vulnerable consumer IoT devices in home networks are the enabler for cyber crimes putting user privacy and Internet security at risk. Internet service providers (ISPs) are best poised to play key roles in mitigating risks by automatically inferring active IoT devices per household and notifying users of vulnerable ones. Developing a scalable inference method that can perform robustly across thousands of home networks is a non-trivial task. This paper focuses on the challenges of developing and applying data-driven inference models when labeled data of device behaviors is limited and the distribution of data changes (concept drift) across time and space domains. Our contributions are three-fold: (1) We collect and analyze network traffic of 24 types of consumer IoT devices from 12 real homes over six weeks to highlight the challenge of temporal and spatial concept drifts in network behavior of IoT devices; (2) We analyze the performance of two inference strategies, namely "global inference" (a model trained on a combined set of all labeled data from training homes) and "contextualized inference" (several models each trained on the labeled data from a training home) in the presence of concept drifts; and (3) To manage concept drifts, we develop a method that dynamically applies the ``closest'' model (from a set) to network traffic of unseen homes during the testing phase, yielding better performance in 20% of scenarios.
    A Unified Causal View of Domain Invariant Representation Learning. (arXiv:2208.06987v3 [stat.ML] UPDATED)
    Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. There are a wide range of proposals for mitigating this problem by learning representations that are ``invariant'' in some sense.However, these methods generally contradict each other, and none of them consistently improve performance on real-world domain shift benchmarks. There are two main questions that must be addressed to understand when, if ever, we should use each method. First, how does each ad hoc notion of ``invariance'' relate to the structure of real-world problems? And, second, when does learning invariant representations actually yield robust models? To address these issues, we introduce a broad formal notion of what it means for a real-world domain shift to admit invariant structure. Then, we characterize the causal structures that are compatible with this notion of invariance.With this in hand, we find conditions under which method-specific invariance notions correspond to real-world invariant structure, and we clarify the relationship between invariant structure and robustness to domain shifts. For both questions, we find that the true underlying causal structure of the data plays a critical role.
    Large Language Models Are Implicitly Topic Models: Explaining and Finding Good Demonstrations for In-Context Learning. (arXiv:2301.11916v1 [cs.CL] CROSS LISTED)
    In recent years, pre-trained large language models have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. The underlying mechanisms by which this capability arises from regular language model pretraining objectives remain poorly understood. In this study, we aim to examine the in-context learning phenomenon through a Bayesian lens, viewing large language models as topic models that implicitly infer task-related information from demonstrations. On this premise, we propose an algorithm for selecting optimal demonstrations from a set of annotated data and demonstrate a significant 12.5% improvement relative to the random selection baseline, averaged over eight GPT2 and GPT3 models on eight different real-world text classification datasets. Our empirical findings support our hypothesis that large language models implicitly infer a latent concept variable.
    Single-Loop Switching Subgradient Methods for Non-Smooth Weakly Convex Optimization with Non-Smooth Convex Constraints. (arXiv:2301.13314v1 [math.OC])
    In this paper, we consider a general non-convex constrained optimization problem, where the objective function is weakly convex and the constraint function is convex while they can both be non-smooth. This class of problems arises from many applications in machine learning such as fairness-aware supervised learning. To solve this problem, we consider the classical switching subgradient method by Polyak (1965), which is an intuitive and easily implementable first-order method. Before this work, its iteration complexity was only known for convex optimization. We prove its oracle complexity for finding a nearly stationary point when the objective function is non-convex. The analysis is derived separately when the constraint function is deterministic and stochastic. Compared to existing methods, especially the double-loop methods, the switching gradient method can be applied to non-smooth problems and only has a single loop, which saves the effort on tuning the number of inner iterations.
    Conversational Automated Program Repair. (arXiv:2301.13246v1 [cs.SE])
    Automated Program Repair (APR) can help developers automatically generate patches for bugs. Due to the impressive performance obtained using Large Pre-Trained Language Models (LLMs) on many code related tasks, researchers have started to directly use LLMs for APR. However, prior approaches simply repeatedly sample the LLM given the same constructed input/prompt created from the original buggy code, which not only leads to generating the same incorrect patches repeatedly but also miss the critical information in testcases. To address these limitations, we propose conversational APR, a new paradigm for program repair that alternates between patch generation and validation in a conversational manner. In conversational APR, we iteratively build the input to the model by combining previously generated patches with validation feedback. As such, we leverage the long-term context window of LLMs to not only avoid generating previously incorrect patches but also incorporate validation feedback to help the model understand the semantic meaning of the program under test. We evaluate 10 different LLM including the newly developed ChatGPT model to demonstrate the improvement of conversational APR over the prior LLM for APR approach.
    On the Statistical Benefits of Temporal Difference Learning. (arXiv:2301.13289v1 [cs.LG])
    Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.
    MILO: Model-Agnostic Subset Selection Framework for Efficient Model Training and Tuning. (arXiv:2301.13287v1 [cs.LG])
    Training deep networks and tuning hyperparameters on large datasets is computationally intensive. One of the primary research directions for efficient training is to reduce training costs by selecting well-generalizable subsets of training data. Compared to simple adaptive random subset selection baselines, existing intelligent subset selection approaches are not competitive due to the time-consuming subset selection step, which involves computing model-dependent gradients and feature embeddings and applies greedy maximization of submodular objectives. Our key insight is that removing the reliance on downstream model parameters enables subset selection as a pre-processing step and enables one to train multiple models at no additional cost. In this work, we propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training while enabling superior model convergence and performance by using an easy-to-hard curriculum. Our empirical results indicate that MILO can train models $3\times - 10 \times$ faster and tune hyperparameters $20\times - 75 \times$ faster than full-dataset training or tuning without compromising performance.
    Temporal Consistency Loss for Physics-Informed Neural Networks. (arXiv:2301.13262v1 [physics.flu-dyn])
    Physics-informed neural networks (PINNs) have been widely used to solve partial differential equations in a forward and inverse manner using deep neural networks. However, training these networks can be challenging for multiscale problems. While statistical methods can be employed to scale the regression loss on data, it is generally challenging to scale the loss terms for equations. This paper proposes a method for scaling the mean squared loss terms in the objective function used to train PINNs. Instead of using automatic differentiation to calculate the temporal derivative, we use backward Euler discretization. This provides us with a scaling term for the equations. In this work, we consider the two and three-dimensional Navier-Stokes equations and determine the kinematic viscosity using the spatio-temporal data on the velocity and pressure fields. We first consider numerical datasets to test our method. We test the sensitivity of our method to the time step size, the number of timesteps, noise in the data, and spatial resolution. Finally, we use the velocity field obtained using Particle Image Velocimetry (PIV) experiments to generate a reference pressure field. We then test our framework using the velocity and reference pressure field.
    Online Loss Function Learning. (arXiv:2301.13247v1 [cs.LG])
    Loss function learning is a new meta-learning paradigm that aims to automate the essential task of designing a loss function for a machine learning model. Existing techniques for loss function learning have shown promising results, often improving a model's training dynamics and final inference performance. However, a significant limitation of these techniques is that the loss functions are meta-learned in an offline fashion, where the meta-objective only considers the very first few steps of training, which is a significantly shorter time horizon than the one typically used for training deep neural networks. This causes significant bias towards loss functions that perform well at the very start of training but perform poorly at the end of training. To address this issue we propose a new loss function learning technique for adaptively updating the loss function online after each update to the base model parameters. The experimental results show that our proposed method consistently outperforms the cross-entropy loss and offline loss function learning techniques on a diverse range of neural network architectures and datasets.
    SoftTreeMax: Exponential Variance Reduction in Policy Gradient via Tree Search. (arXiv:2301.13236v1 [cs.LG])
    Despite the popularity of policy gradient methods, they are known to suffer from large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -- a generalization of softmax that takes planning into account. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We consider two variants of SoftTreeMax, one for cumulative reward and one for exponentiated reward. For both, we analyze the gradient variance and reveal for the first time the role of a tree expansion policy in mitigating this variance. We prove that the resulting variance decays exponentially with the planning horizon as a function of the expansion policy. Specifically, we show that the closer the resulting state transitions are to uniform, the faster the decay. In a practical implementation, we utilize a parallelized GPU-based simulator for fast and efficient tree search. Our differentiable tree-based policy leverages all gradients at the tree leaves in each environment step instead of the traditional single-sample-based gradient. We then show in simulation how the variance of the gradient is reduced by three orders of magnitude, leading to better sample complexity compared to the standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in a faster run time compared to distributed PPO. Lastly, we demonstrate that high reward correlates with lower variance.
    Probabilistic Neural Data Fusion for Learning from an Arbitrary Number of Multi-fidelity Data Sets. (arXiv:2301.13271v1 [cs.LG])
    In many applications in engineering and sciences analysts have simultaneous access to multiple data sources. In such cases, the overall cost of acquiring information can be reduced via data fusion or multi-fidelity (MF) modeling where one leverages inexpensive low-fidelity (LF) sources to reduce the reliance on expensive high-fidelity (HF) data. In this paper, we employ neural networks (NNs) for data fusion in scenarios where data is very scarce and obtained from an arbitrary number of sources with varying levels of fidelity and cost. We introduce a unique NN architecture that converts MF modeling into a nonlinear manifold learning problem. Our NN architecture inversely learns non-trivial (e.g., non-additive and non-hierarchical) biases of the LF sources in an interpretable and visualizable manifold where each data source is encoded via a low-dimensional distribution. This probabilistic manifold quantifies model form uncertainties such that LF sources with small bias are encoded close to the HF source. Additionally, we endow the output of our NN with a parametric distribution not only to quantify aleatoric uncertainties, but also to reformulate the network's loss function based on strictly proper scoring rules which improve robustness and accuracy on unseen HF data. Through a set of analytic and engineering examples, we demonstrate that our approach provides a high predictive power while quantifying various sources uncertainties.
    Retrosynthetic Planning with Dual Value Networks. (arXiv:2301.13755v1 [cs.AI])
    Retrosynthesis, which aims to find a route to synthesize a target molecule from commercially available starting materials, is a critical task in drug discovery and materials design. Recently, the combination of ML-based single-step reaction predictors with multi-step planners has led to promising results. However, the single-step predictors are mostly trained offline to optimize the single-step accuracy, without considering complete routes. Here, we leverage reinforcement learning (RL) to improve the single-step predictor, by using a tree-shaped MDP to optimize complete routes while retaining single-step accuracy. Desirable routes should be both synthesizable and of low cost. We propose an online training algorithm, called Planning with Dual Value Networks (PDVN), in which two value networks predict the synthesizability and cost of molecules, respectively. To maintain the single-step accuracy, we design a two-branch network structure for the single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm improves the search success rate of existing multi-step planners (e.g., increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the number of model calls by half while solving 99.47% molecules for RetroGraph). Furthermore, PDVN finds shorter synthesis routes (e.g., reducing the average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for RetroGraph).
    Emergence of Maps in the Memories of Blind Navigation Agents. (arXiv:2301.13261v1 [cs.AI])
    Animal navigation research posits that organisms build and maintain internal spatial representations, or maps, of their environment. We ask if machines -- specifically, artificial intelligence (AI) navigation agents -- also build implicit (or 'mental') maps. A positive answer to this question would (a) explain the surprising phenomenon in recent literature of ostensibly map-free neural-networks achieving strong performance, and (b) strengthen the evidence of mapping as a fundamental mechanism for navigation by intelligent embodied agents, whether they be biological or artificial. Unlike animal navigation, we can judiciously design the agent's perceptual system and control the learning paradigm to nullify alternative navigation mechanisms. Specifically, we train 'blind' agents -- with sensing limited to only egomotion and no other sensing of any kind -- to perform PointGoal navigation ('go to $\Delta$ x, $\Delta$ y') via reinforcement learning. Our agents are composed of navigation-agnostic components (fully-connected and recurrent neural networks), and our experimental setup provides no inductive bias towards mapping. Despite these harsh conditions, we find that blind agents are (1) surprisingly effective navigators in new environments (~95% success); (2) they utilize memory over long horizons (remembering ~1,000 steps of past experience in an episode); (3) this memory enables them to exhibit intelligent behavior (following walls, detecting collisions, taking shortcuts); (4) there is emergence of maps and collision detection neurons in the representations of the environment built by a blind agent as it navigates; and (5) the emergent maps are selective and task dependent (e.g. the agent 'forgets' exploratory detours). Overall, this paper presents no new techniques for the AI audience, but a surprising finding, an insight, and an explanation.
    Interpreting Robustness Proofs of Deep Neural Networks. (arXiv:2301.13845v1 [cs.LG])
    In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs behavior, it is not clear whether the proofs generated by these methods are human-interpretable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable interpretations of the proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely on spurious input features, while the proofs of DNNs trained to be provably robust filter out even the semantically meaningful features. The proofs for the DNNs combining adversarial and provably robust training are the most effective at selectively filtering out spurious features as well as relying on human-understandable input features.
    Contextual Pandora's Box. (arXiv:2205.13114v2 [cs.LG] UPDATED)
    Pandora's Box is a fundamental stochastic optimization problem, where the decision-maker must find a good alternative while minimizing the search cost of exploring the value of each alternative. In the original formulation, it is assumed that accurate distributions are given for the values of all the alternatives, while recent work studies the online variant of Pandora's Box where the distributions are originally unknown. In this work, we study Pandora's Box in the online setting, while incorporating context. At every round, we are presented with a number of alternatives each having a context, an exploration cost and an unknown value drawn from an unknown distribution that may change at every round. Our main result is a no-regret algorithm that performs comparably well to the optimal algorithm which knows all prior distributions exactly. Our algorithm works even in the bandit setting where the algorithm never learns the values of the alternatives that were not explored. The key technique that enables our result is a novel modification of the realizability condition in contextual bandits that connects a context to a sufficient statistic of each alternative's distribution (its "reservation value") rather than its mean.
    Multi-fidelity covariance estimation in the log-Euclidean geometry. (arXiv:2301.13749v1 [stat.CO])
    We introduce a multi-fidelity estimator of covariance matrices that employs the log-Euclidean geometry of the symmetric positive-definite manifold. The estimator fuses samples from a hierarchy of data sources of differing fidelities and costs for variance reduction while guaranteeing definiteness, in contrast with previous approaches. The new estimator makes covariance estimation tractable in applications where simulation or data collection is expensive; to that end, we develop an optimal sample allocation scheme that minimizes the mean-squared error of the estimator given a fixed budget. Guaranteed definiteness is crucial to metric learning, data assimilation, and other downstream tasks. Evaluations of our approach using data from physical applications (heat conduction, fluid dynamics) demonstrate more accurate metric learning and speedups of more than one order of magnitude compared to benchmarks.
    Personalized Subgraph Federated Learning. (arXiv:2206.10206v2 [cs.LG] UPDATED)
    Subgraphs of a larger global graph may be distributed across multiple devices, and only locally accessible due to privacy restrictions, although there may be links between subgraphs. Recently proposed subgraph Federated Learning (FL) methods deal with those missing links across local subgraphs while distributively training Graph Neural Networks (GNNs) on them. However, they have overlooked the inevitable heterogeneity between subgraphs comprising different communities of a global graph, consequently collapsing the incompatible knowledge from local GNN models. To this end, we introduce a new subgraph FL problem, personalized subgraph FL, which focuses on the joint improvement of the interrelated local GNNs rather than learning a single global model, and propose a novel framework, FEDerated Personalized sUBgraph learning (FED-PUB), to tackle it. Since the server cannot access the subgraph in each client, FED-PUB utilizes functional embeddings of the local GNNs using random graphs as inputs to compute similarities between them, and use the similarities to perform weighted averaging for server-side aggregation. Further, it learns a personalized sparse mask at each client to select and update only the subgraph-relevant subset of the aggregated parameters. We validate our FED-PUB for its subgraph FL performance on six datasets, considering both non-overlapping and overlapping subgraphs, on which it significantly outperforms relevant baselines.
    Preserving local densities in low-dimensional embeddings. (arXiv:2301.13732v1 [cs.LG])
    Low-dimensional embeddings and visualizations are an indispensable tool for analysis of high-dimensional data. State-of-the-art methods, such as tSNE and UMAP, excel in unveiling local structures hidden in high-dimensional data and are therefore routinely applied in standard analysis pipelines in biology. We show, however, that these methods fail to reconstruct local properties, such as relative differences in densities (Fig. 1) and that apparent differences in cluster size can arise from computational artifact caused by differing sample sizes (Fig. 2). Providing a theoretical analysis of this issue, we then suggest dtSNE, which approximately conserves local densities. In an extensive study on synthetic benchmark and real world data comparing against five state-of-the-art methods, we empirically show that dtSNE provides similar global reconstruction, but yields much more accurate depictions of local distances and relative densities.
    Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. (arXiv:2301.13743v1 [cs.CV])
    Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.
    Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments. (arXiv:2301.13446v1 [cs.LG])
    We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.
    Are Defenses for Graph Neural Networks Robust?. (arXiv:2301.13694v1 [cs.LG])
    A cursory reading of the literature suggests that we have made a lot of progress in designing effective adversarial defenses for Graph Neural Networks (GNNs). Yet, the standard methodology has a serious flaw - virtually all of the defenses are evaluated against non-adaptive attacks leading to overly optimistic robustness estimates. We perform a thorough robustness analysis of 7 of the most popular defenses spanning the entire spectrum of strategies, i.e., aimed at improving the graph, the architecture, or the training. The results are sobering - most defenses show no or only marginal improvement compared to an undefended baseline. We advocate using custom adaptive attacks as a gold standard and we outline the lessons we learned from successfully designing such attacks. Moreover, our diverse collection of perturbed graphs forms a (black-box) unit test offering a first glance at a model's robustness.
    Identifying the Hazard Boundary of ML-enabled Autonomous Systems Using Cooperative Co-Evolutionary Search. (arXiv:2301.13807v1 [cs.SE])
    In Machine Learning (ML)-enabled autonomous systems (MLASs), it is essential to identify the hazard boundary of ML Components (MLCs) in the MLAS under analysis. Given that such boundary captures the conditions in terms of MLC behavior and system context that can lead to hazards, it can then be used to, for example, build a safety monitor that can take any predefined fallback mechanisms at runtime when reaching the hazard boundary. However, determining such hazard boundary for an ML component is challenging. This is due to the space combining system contexts (i.e., scenarios) and MLC behaviors (i.e., inputs and outputs) being far too large for exhaustive exploration and even to handle using conventional metaheuristics, such as genetic algorithms. Additionally, the high computational cost of simulations required to determine any MLAS safety violations makes the problem even more challenging. Furthermore, it is unrealistic to consider a region in the problem space deterministically safe or unsafe due to the uncontrollable parameters in simulations and the non-linear behaviors of ML models (e.g., deep neural networks) in the MLAS under analysis. To address the challenges, we propose MLCSHE (ML Component Safety Hazard Envelope), a novel method based on a Cooperative Co-Evolutionary Algorithm (CCEA), which aims to tackle a high-dimensional problem by decomposing it into two lower-dimensional search subproblems. Moreover, we take a probabilistic view of safe and unsafe regions and define a novel fitness function to measure the distance from the probabilistic hazard boundary and thus drive the search effectively. We evaluate the effectiveness and efficiency of MLCSHE on a complex Autonomous Vehicle (AV) case study. Our evaluation results show that MLCSHE is significantly more effective and efficient compared to a standard genetic algorithm and random search.
    An $l_1$-oracle inequality for the Lasso in high-dimensional mixtures of experts models. (arXiv:2009.10622v5 [math.ST] UPDATED)
    Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.
    Video Influencers: Unboxing the Mystique. (arXiv:2012.12311v2 [cs.LG] UPDATED)
    Influencer marketing has become a very popular tool to reach customers. Despite the rapid growth in influencer videos, there has been little research on the effectiveness of their constituent elements in explaining video engagement. We study YouTube influencers and analyze their unstructured video data across text, audio and images using a novel "interpretable deep learning" framework that accomplishes both goals of prediction and interpretation. Our prediction-based approach analyzes unstructured data and finds that "what is said" in words (text) is more influential than "how it is said" in imagery (images) followed by acoustics (audio). Our interpretation-based approach is implemented after completion of model prediction by analyzing the same source of unstructured data to measure importance attributed to the video elements. We eliminate several spurious and confounded relationships, and identify a smaller subset of theory-based relationships. We uncover novel findings that establish distinct effects for measures of shallow and deep engagement which are based on the dual-system framework of human thinking. Our approach is validated using simulated data, and we discuss the learnings from our findings for influencers and brands.
    Unsupervised Music Source Separation Using Differentiable Parametric Source Models. (arXiv:2201.09592v2 [cs.SD] UPDATED)
    Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models' parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.
    Auxiliary Learning as an Asymmetric Bargaining Game. (arXiv:2301.13501v1 [cs.LG])
    Auxiliary learning is an effective method for enhancing the generalization capabilities of trained models, particularly when dealing with small datasets. However, this approach may present several difficulties: (i) optimizing multiple objectives can be more challenging, and (ii) how to balance the auxiliary tasks to best assist the main task is unclear. In this work, we propose a novel approach, named AuxiNash, for balancing tasks in auxiliary learning by formalizing the problem as generalized bargaining game with asymmetric task bargaining power. Furthermore, we describe an efficient procedure for learning the bargaining power of tasks based on their contribution to the performance of the main task and derive theoretical guarantees for its convergence. Finally, we evaluate AuxiNash on multiple multi-task benchmarks and find that it consistently outperforms competing methods.
    An Analysis of Classification Approaches for Hit Song Prediction using Engineered Metadata Features with Lyrics and Audio Features. (arXiv:2301.13507v1 [cs.IR])
    Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata, including song audio features provided by Spotify, song lyrics, and novel metadata-based features (title topic, popularity continuity and genre class). Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron. Our results show that Random Forest (RF) and Logistic Regression (LR) with all features (including novel features, song audio features and lyrics features) outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively. Our findings also demonstrate the utility of our novel music metadata features, which contributed most to the models' discriminative performance.
    NP-Match: Towards a New Probabilistic Model for Semi-Supervised Learning. (arXiv:2301.13569v1 [cs.CV])
    Semi-supervised learning (SSL) has been widely explored in recent years, and it is an effective way of leveraging unlabeled data to reduce the reliance on labeled data. In this work, we adjust neural processes (NPs) to the semi-supervised image classification task, resulting in a new method named NP-Match. NP-Match is suited to this task for two reasons. Firstly, NP-Match implicitly compares data points when making predictions, and as a result, the prediction of each unlabeled data point is affected by the labeled data points that are similar to it, which improves the quality of pseudo-labels. Secondly, NP-Match is able to estimate uncertainty that can be used as a tool for selecting unlabeled samples with reliable pseudo-labels. Compared with uncertainty-based SSL methods implemented with Monte-Carlo (MC) dropout, NP-Match estimates uncertainty with much less computational overhead, which can save time at both the training and the testing phases. We conducted extensive experiments on five public datasets under three semi-supervised image classification settings, namely, the standard semi-supervised image classification, the imbalanced semi-supervised image classification, and the multi-label semi-supervised image classification, and NP-Match outperforms state-of-the-art (SOTA) approaches or achieves competitive results on them, which shows the effectiveness of NP-Match and its potential for SSL. The codes are at https://github.com/Jianf-Wang/NP-Match
    Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits. (arXiv:2301.13393v1 [cs.LG])
    Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
    Recurrences reveal shared causal drivers of complex time series. (arXiv:2301.13516v1 [cs.LG])
    Many experimental time series measurements share an unobserved causal driver. Examples include genes targeted by transcription factors, ocean flows influenced by large-scale atmospheric currents, and motor circuits steered by descending neurons. Reliably inferring this unseen driving force is necessary to understand the intermittent nature of top-down control schemes in diverse biological and engineered systems. Here, we introduce a new unsupervised learning algorithm that uses recurrences in time series measurements to gradually reconstruct an unobserved driving signal. Drawing on the mathematical theory of skew-product dynamical systems, we identify recurrence events shared across response time series, which implicitly define a recurrence graph with glass-like structure. As the amount or quality of observed data improves, this recurrence graph undergoes a percolation transition manifesting as weak ergodicity breaking for random walks on the induced landscape -- revealing the shared driver's dynamics, even in the presence of strongly corrupted or noisy measurements. Across several thousand random dynamical systems, we empirically quantify the dependence of reconstruction accuracy on the rate of information transfer from a chaotic driver to the response systems, and we find that effective reconstruction proceeds through gradual approximation of the driver's dominant unstable periodic orbits. Through extensive benchmarks against classical and neural-network-based signal processing techniques, we demonstrate our method's strong ability to extract causal driving signals from diverse real-world datasets spanning neuroscience, genomics, fluid dynamics, and physiology.
    Tricking AI chips into Simulating the Human Brain: A Detailed Performance Analysis. (arXiv:2301.13637v1 [cs.LG])
    Challenging the Nvidia monopoly, dedicated AI-accelerator chips have begun emerging for tackling the computational challenge that the inference and, especially, the training of modern deep neural networks (DNNs) poses to modern computers. The field has been ridden with studies assessing the performance of these contestants across various DNN model types. However, AI-experts are aware of the limitations of current DNNs and have been working towards the fourth AI wave which will, arguably, rely on more biologically inspired models, predominantly on spiking neural networks (SNNs). At the same time, GPUs have been heavily used for simulating such models in the field of computational neuroscience, yet AI-chips have not been tested on such workloads. The current paper aims at filling this important gap by evaluating multiple, cutting-edge AI-chips (Graphcore IPU, GroqChip, Nvidia GPU with Tensor Cores and Google TPU) on simulating a highly biologically detailed model of a brain region, the inferior olive (IO). This IO application stress-tests the different AI-platforms for highlighting architectural tradeoffs by varying its compute density, memory requirements and floating-point numerical accuracy. Our performance analysis reveals that the simulation problem maps extremely well onto the GPU and TPU architectures, which for networks of 125,000 cells leads to a 28x respectively 1,208x speedup over CPU runtimes. At this speed, the TPU sets a new record for largest real-time IO simulation. The GroqChip outperforms both platforms for small networks but, due to implementing some floating-point operations at reduced accuracy, is found not yet usable for brain simulation.
    V2N Service Scaling with Deep Reinforcement Learning. (arXiv:2301.13324v1 [cs.LG])
    The fifth generation (5G) of wireless networks is set out to meet the stringent requirements of vehicular use cases. Edge computing resources can aid in this direction by moving processing closer to end-users, reducing latency. However, given the stochastic nature of traffic loads and availability of physical resources, appropriate auto-scaling mechanisms need to be employed to support cost-efficient and performant services. To this end, we employ Deep Reinforcement Learning (DRL) for vertical scaling in Edge computing to support vehicular-to-network communications. We address the problem using Deep Deterministic Policy Gradient (DDPG). As DDPG is a model-free off-policy algorithm for learning continuous actions, we introduce a discretization approach to support discrete scaling actions. Thus we address scalability problems inherent to high-dimensional discrete action spaces. Employing a real-world vehicular trace data set, we show that DDPG outperforms existing solutions, reducing (at minimum) the average number of active CPUs by 23% while increasing the long-term reward by 24%.
    When Source-Free Domain Adaptation Meets Learning with Noisy Labels. (arXiv:2301.13381v1 [cs.LG])
    Recent state-of-the-art source-free domain adaptation (SFDA) methods have focused on learning meaningful cluster structures in the feature space, which have succeeded in adapting the knowledge from source domain to unlabeled target domain without accessing the private source data. However, existing methods rely on the pseudo-labels generated by source models that can be noisy due to domain shift. In this paper, we study SFDA from the perspective of learning with label noise (LLN). Unlike the label noise in the conventional LLN scenario, we prove that the label noise in SFDA follows a different distribution assumption. We also prove that such a difference makes existing LLN methods that rely on their distribution assumptions unable to address the label noise in SFDA. Empirical evidence suggests that only marginal improvements are achieved when applying the existing LLN methods to solve the SFDA problem. On the other hand, although there exists a fundamental difference between the label noise in the two scenarios, we demonstrate theoretically that the early-time training phenomenon (ETP), which has been previously observed in conventional label noise settings, can also be observed in the SFDA problem. Extensive experiments demonstrate significant improvements to existing SFDA algorithms by leveraging ETP to address the label noise in SFDA.
    Automated Time-frequency Domain Audio Crossfades using Graph Cuts. (arXiv:2301.13380v1 [cs.SD])
    The problem of transitioning smoothly from one audio clip to another arises in many music consumption scenarios, especially as music consumption has moved from professionally curated and live-streamed radios to personal playback devices and services. we present the first steps toward a new method of automatically transitioning from one audio clip to another by discretizing the frequency spectrum into bins and then finding transition times for each bin. We phrase the problem as one of graph flow optimization; specifically min-cut/max-flow.
    Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation. (arXiv:2301.13389v1 [cs.LG])
    While it is tempting to believe that data distillation preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, called differentially private kernel inducing points (DP-KIP). DP-KIP is an instantiation of DP-SGD on kernel ridge regression (KRR). Following a recent work, we use neural tangent kernels and minimize the KRR loss to estimate the distilled datapoints (i.e., kernel inducing points). We provide a computationally efficient JAX implementation of DP-KIP, which we test on several popular image and tabular datasets to show its efficacy in data distillation with differential privacy guarantees.
    Optimizing DDPM Sampling with Shortcut Fine-Tuning. (arXiv:2301.13362v1 [cs.LG])
    In this study, we propose Shortcut Fine-tuning (SFT), a new approach for addressing the challenge of fast sampling of pretrained Denoising Diffusion Probabilistic Models (DDPMs). SFT advocates for the fine-tuning of DDPM samplers through the direct minimization of Integral Probability Metrics (IPM), instead of learning the backward diffusion process. This enables samplers to discover an alternative and more efficient sampling shortcut, deviating from the backward diffusion process. We also propose a new algorithm that is similar to the policy gradient method for fine-tuning DDPMs by proving that under certain assumptions, the gradient descent of diffusion models is equivalent to the policy gradient approach. Through empirical evaluation, we demonstrate that our fine-tuning method can further enhance existing fast DDPM samplers, resulting in sample quality comparable to or even surpassing that of the full-step model across various datasets.
    Misspecification-robust Sequential Neural Likelihood. (arXiv:2301.13368v1 [stat.ME])
    Simulation-based inference (SBI) techniques are now an essential tool for the parameter estimation of mechanistic and simulatable models with intractable likelihoods. Statistical approaches to SBI such as approximate Bayesian computation and Bayesian synthetic likelihood have been well studied in the well specified and misspecified settings. However, most implementations are inefficient in that many model simulations are wasted. Neural approaches such as sequential neural likelihood (SNL) have been developed that exploit all model simulations to build a surrogate of the likelihood function. However, SNL approaches have been shown to perform poorly under model misspecification. In this paper, we develop a new method for SNL that is robust to model misspecification and can identify areas where the model is deficient. We demonstrate the usefulness of the new approach on several illustrative examples.
    Self-Consistent Velocity Matching of Probability Flows. (arXiv:2301.13737v1 [cs.LG])
    We present a discretization-free scalable framework for solving a large class of mass-conserving partial differential equations (PDEs), including the time-dependent Fokker-Planck equation and the Wasserstein gradient flow. The main observation is that the time-varying velocity field of the PDE solution needs to be self-consistent: it must satisfy a fixed-point equation involving the flow characterized by the same velocity field. By parameterizing the flow as a time-dependent neural network, we propose an end-to-end iterative optimization framework called self-consistent velocity matching to solve this class of PDEs. Compared to existing approaches, our method does not suffer from temporal or spatial discretization, covers a wide range of PDEs, and scales to high dimensions. Experimentally, our method recovers analytical solutions accurately when they are available and achieves comparable or better performance in high dimensions with less training time compared to recent large-scale JKO-based methods that are designed for solving a more restrictive family of PDEs.
    Near Optimal Private and Robust Linear Regression. (arXiv:2301.13273v1 [cs.LG])
    We study the canonical statistical estimation problem of linear regression from $n$ i.i.d.~examples under $(\varepsilon,\delta)$-differential privacy when some response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves a near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to guarantee both $(\varepsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach.
    Learning Coordination Policies over Heterogeneous Graphs for Human-Robot Teams via Recurrent Neural Schedule Propagation. (arXiv:2301.13279v1 [cs.AI])
    As human-robot collaboration increases in the workforce, it becomes essential for human-robot teams to coordinate efficiently and intuitively. Traditional approaches for human-robot scheduling either utilize exact methods that are intractable for large-scale problems and struggle to account for stochastic, time varying human task performance, or application-specific heuristics that require expert domain knowledge to develop. We propose a deep learning-based framework, called HybridNet, combining a heterogeneous graph-based encoder with a recurrent schedule propagator for scheduling stochastic human-robot teams under upper- and lower-bound temporal constraints. The HybridNet's encoder leverages Heterogeneous Graph Attention Networks to model the initial environment and team dynamics while accounting for the constraints. By formulating task scheduling as a sequential decision-making process, the HybridNet's recurrent neural schedule propagator leverages Long Short-Term Memory (LSTM) models to propagate forward consequences of actions to carry out fast schedule generation, removing the need to interact with the environment between every task-agent pair selection. The resulting scheduling policy network provides a computationally lightweight yet highly expressive model that is end-to-end trainable via Reinforcement Learning algorithms. We develop a virtual task scheduling environment for mixed human-robot teams in a multi-round setting, capable of modeling the stochastic learning behaviors of human workers. Experimental results showed that HybridNet outperformed other human-robot scheduling solutions across problem sizes for both deterministic and stochastic human performance, with faster runtime compared to pure-GNN-based schedulers.  ( 2 min )
    Sifer: Overcoming simplicity bias in deep networks using a feature sieve. (arXiv:2301.13293v1 [cs.LG])
    Simplicity bias is the concerning tendency of deep networks to over-depend on simple, weakly predictive features, to the exclusion of stronger, more complex features. This causes biased, incorrect model predictions in many real-world applications, exacerbated by incomplete training data containing spurious feature-label correlations. We propose a direct, interventional method for addressing simplicity bias in DNNs, which we call the feature sieve. We aim to automatically identify and suppress easily-computable spurious features in lower layers of the network, thereby allowing the higher network levels to extract and utilize richer, more meaningful representations. We provide concrete evidence of this differential suppression & enhancement of relevant features on both controlled datasets and real-world images, and report substantial gains on many real-world debiasing benchmarks (11.4% relative gain on Imagenet-A; 3.2% on BAR, etc). Crucially, we outperform many baselines that incorporate knowledge about known spurious or biased attributes, despite our method not using any such information. We believe that our feature sieve work opens up exciting new research directions in automated adversarial feature extraction & representation learning for deep networks.  ( 2 min )
  • Open

    Preserving local densities in low-dimensional embeddings. (arXiv:2301.13732v1 [cs.LG])
    Low-dimensional embeddings and visualizations are an indispensable tool for analysis of high-dimensional data. State-of-the-art methods, such as tSNE and UMAP, excel in unveiling local structures hidden in high-dimensional data and are therefore routinely applied in standard analysis pipelines in biology. We show, however, that these methods fail to reconstruct local properties, such as relative differences in densities (Fig. 1) and that apparent differences in cluster size can arise from computational artifact caused by differing sample sizes (Fig. 2). Providing a theoretical analysis of this issue, we then suggest dtSNE, which approximately conserves local densities. In an extensive study on synthetic benchmark and real world data comparing against five state-of-the-art methods, we empirically show that dtSNE provides similar global reconstruction, but yields much more accurate depictions of local distances and relative densities.  ( 2 min )
    Bayesian Bilinear Neural Network for Predicting the Mid-price Dynamics in Limit-Order Book Markets. (arXiv:2203.03613v2 [econ.EM] UPDATED)
    The prediction of financial markets is a challenging yet important task. In modern electronically-driven markets, traditional time-series econometric methods often appear incapable of capturing the true complexity of the multi-level interactions driving the price dynamics. While recent research has established the effectiveness of traditional machine learning (ML) models in financial applications, their intrinsic inability to deal with uncertainties, which is a great concern in econometrics research and real business applications, constitutes a major drawback. Bayesian methods naturally appear as a suitable remedy conveying the predictive ability of ML methods with the probabilistically-oriented practice of econometric research. By adopting a state-of-the-art second-order optimization algorithm, we train a Bayesian bilinear neural network with temporal attention, suitable for the challenging time-series task of predicting mid-price movements in ultra-high-frequency limit-order book markets. We thoroughly compare our Bayesian model with traditional ML alternatives by addressing the use of predictive distributions to analyze errors and uncertainties associated with the estimated parameters and model forecasts. Our results underline the feasibility of the Bayesian deep-learning approach and its predictive and decisional advantages in complex econometric tasks, prompting future research in this direction.
    Learning from many trajectories. (arXiv:2203.17193v2 [cs.LG] UPDATED)
    We initiate a study of supervised learning from many independent sequences ("trajectories") of non-independent covariates, reflecting tasks in sequence modeling, control, and reinforcement learning. Conceptually, our multi-trajectory setup sits between two traditional settings in statistical learning theory: learning from independent examples and learning from a single auto-correlated sequence. Our conditions for efficient learning generalize the former setting--trajectories must be non-degenerate in ways that extend standard requirements for independent examples. Notably, we do not require that trajectories be ergodic, long, nor strictly stable. For linear least-squares regression, given $n$-dimensional examples produced by $m$ trajectories, each of length $T$, we observe a notable change in statistical efficiency as the number of trajectories increases from a few (namely $m \lesssim n$) to many (namely $m \gtrsim n$). Specifically, we establish that the worst-case error rate of this problem is $\Theta(n / m T)$ whenever $m \gtrsim n$. Meanwhile, when $m \lesssim n$, we establish a (sharp) lower bound of $\Omega(n^2 / m^2 T)$ on the worst-case error rate, realized by a simple, marginally unstable linear dynamical system. A key upshot is that, in domains where trajectories regularly reset, the error rate eventually behaves as if all of the examples were independent, drawn from their marginals. As a corollary of our analysis, we also improve guarantees for the linear system identification problem.
    Learning Generalized Hybrid Proximity Representation for Image Recognition. (arXiv:2301.13459v1 [cs.CV])
    Recently, deep metric learning techniques received attention, as the learned distance representations are useful to capture the similarity relationship among samples and further improve the performance of various of supervised or unsupervised learning tasks. We propose a novel supervised metric learning method that can learn the distance metrics in both geometric and probabilistic space for image recognition. In contrast to the previous metric learning methods which usually focus on learning the distance metrics in Euclidean space, our proposed method is able to learn better distance representation in a hybrid approach. To achieve this, we proposed a Generalized Hybrid Metric Loss (GHM-Loss) to learn the general hybrid proximity features from the image data by controlling the trade-off between geometric proximity and probabilistic proximity. To evaluate the effectiveness of our method, we first provide theoretical derivations and proofs of the proposed loss function, then we perform extensive experiments on two public datasets to show the advantage of our method compared to other state-of-the-art metric learning methods.
    Improved Algorithms for Multi-period Multi-class Packing Problems with~Bandit~Feedback. (arXiv:2301.13791v1 [stat.ML])
    We consider the linear contextual multi-class multi-period packing problem~(LMMP) where the goal is to pack items such that the total vector of consumption is below a given budget vector and the total value is as large as possible. We consider the setting where the reward and the consumption vector associated with each action is a class-dependent linear function of the context, and the decision-maker receives bandit feedback. LMMP includes linear contextual bandits with knapsacks and online revenue management as special cases. We establish a new more efficient estimator which guarantees a faster convergence rate, and consequently, a lower regret in such problems. We propose a bandit policy that is a closed-form function of said estimated parameters. When the contexts are non-degenerate, the regret of the proposed policy is sublinear in the context dimension, the number of classes, and the time horizon~$T$ when the budget grows at least as $\sqrt{T}$. We also resolve an open problem posed in Agrawal & Devanur (2016), and extend the result to a multi-class setting. Our numerical experiments clearly demonstrate that the performance of our policy is superior to other benchmarks in the literature.
    A relaxed proximal gradient descent algorithm for convergent plug-and-play with proximal denoiser. (arXiv:2301.13731v1 [stat.ML])
    This paper presents a new convergent Plug-and-Play (PnP) algorithm. PnP methods are efficient iterative algorithms for solving image inverse problems formulated as the minimization of the sum of a data-fidelity term and a regularization term. PnP methods perform regularization by plugging a pre-trained denoiser in a proximal algorithm, such as Proximal Gradient Descent (PGD). To ensure convergence of PnP schemes, many works study specific parametrizations of deep denoisers. However, existing results require either unverifiable or suboptimal hypotheses on the denoiser, or assume restrictive conditions on the parameters of the inverse problem. Observing that these limitations can be due to the proximal algorithm in use, we study a relaxed version of the PGD algorithm for minimizing the sum of a convex function and a weakly convex one. When plugged with a relaxed proximal denoiser, we show that the proposed PnP-$\alpha$PGD algorithm converges for a wider range of regularization parameters, thus allowing more accurate image restoration.
    Stabilize Deep ResNet with A Sharp Scaling Factor $\tau$. (arXiv:1903.07120v5 [cs.LG] UPDATED)
    We study the stability and convergence of training deep ResNets with gradient descent. Specifically, we show that the parametric branch in the residual block should be scaled down by a factor $\tau =O(1/\sqrt{L})$ to guarantee stable forward/backward process, where $L$ is the number of residual blocks. Moreover, we establish a converse result that the forward process is unbounded when $\tau>L^{-\frac{1}{2}+c}$, for any positive constant $c$. The above two results together establish a sharp value of the scaling factor in determining the stability of deep ResNet. Based on the stability result, we further show that gradient descent finds the global minima if the ResNet is properly over-parameterized, which significantly improves over the previous work with a much larger range of $\tau$ that admits global convergence. Moreover, we show that the convergence rate is independent of the depth, theoretically justifying the advantage of ResNet over vanilla feedforward network. Empirically, with such a factor $\tau$, one can train deep ResNet without normalization layer. Moreover, for ResNets with normalization layer, adding such a factor $\tau$ also stabilizes the training and obtains significant performance gain for deep ResNet.
    Real-Time Outlier Detection with Dynamic Process Limits. (arXiv:2301.13527v1 [cs.LG])
    Anomaly detection methods are part of the systems where rare events may endanger an operation's profitability, safety, and environmental aspects. Although many state-of-the-art anomaly detection methods were developed to date, their deployment is limited to the operation conditions present during the model training. Online anomaly detection brings the capability to adapt to data drifts and change points that may not be represented during model development resulting in prolonged service life. This paper proposes an online anomaly detection algorithm for existing real-time infrastructures where low-latency detection is required and novel patterns in data occur unpredictably. The online inverse cumulative distribution-based approach is introduced to eliminate common problems of offline anomaly detectors, meanwhile providing dynamic process limits to normal operation. The benefit of the proposed method is the ease of use, fast computation, and deployability as shown in two case studies of real microgrid operation data.
    Discovery of Single Independent Latent Variable. (arXiv:2110.05887v2 [stat.ML] UPDATED)
    Latent variable discovery is a central problem in data analysis with a broad range of applications in applied science. In this work, we consider data given as an invertible mixture of two statistically independent components, and assume that one of the components is observed while the other is hidden. Our goal is to recover the hidden component. For this purpose, we propose an autoencoder equipped with a discriminator. Unlike the standard nonlinear ICA problem, which was shown to be non-identifiable, in the special case of ICA we consider here, we show that our approach can recover the component of interest up to entropy-preserving transformation. We demonstrate the performance of the proposed approach on several datasets, including image synthesis, voice cloning, and fetal ECG extraction.
    A Reinforcement Learning Framework for Dynamic Mediation Analysis. (arXiv:2301.13348v1 [stat.ML])
    Mediation analysis learns the causal effect transmitted via mediator variables between treatments and outcomes and receives increasing attention in various scientific domains to elucidate causal relations. Most existing works focus on point-exposure studies where each subject only receives one treatment at a single time point. However, there are a number of applications (e.g., mobile health) where the treatments are sequentially assigned over time and the dynamic mediation effects are of primary interest. Proposing a reinforcement learning (RL) framework, we are the first to evaluate dynamic mediation effects in settings with infinite horizons. We decompose the average treatment effect into an immediate direct effect, an immediate mediation effect, a delayed direct effect, and a delayed mediation effect. Upon the identification of each effect component, we further develop robust and semi-parametrically efficient estimators under the RL framework to infer these causal effects. The superior performance of the proposed method is demonstrated through extensive numerical studies, theoretical results, and an analysis of a mobile health dataset.
    Physics-constrained 3D Convolutional Neural Networks for Electrodynamics. (arXiv:2301.13715v1 [physics.acc-ph])
    We present a physics-constrained neural network (PCNN) approach to solving Maxwell's equations for the electromagnetic fields of intense relativistic charged particle beams. We create a 3D convolutional PCNN to map time-varying current and charge densities J(r,t) and p(r,t) to vector and scalar potentials A(r,t) and V(r,t) from which we generate electromagnetic fields according to Maxwell's equations: B=curl(A), E=-div(V)-dA/dt. Our PCNNs satisfy hard constraints, such as div(B)=0, by construction. Soft constraints push A and V towards satisfying the Lorenz gauge.
    What can be learnt with wide convolutional neural networks?. (arXiv:2208.01003v4 [stat.ML] UPDATED)
    Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional functions remains a fundamental challenge. A popular belief is that these models harness the local and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how such structure affects performance, e.g. the rate of decay of the generalisation error with the number of training samples. In this paper, we study deep CNNs in the kernel regime. First, we show that the spectrum of the corresponding kernel inherits the hierarchical structure of the network, and we characterise its asymptotics. Then, we use this result together with generalisation bounds to prove that deep CNNs adapt to the spatial scale of the target function. In particular, we find that if the target function depends on low-dimensional subsets of adjacent input variables, then the rate of decay of the error is controlled by the effective dimensionality of these subsets. Conversely, if the target function depends on the full set of input variables, then the error rate is inversely proportional to the input dimension. We conclude by computing the rate when a deep CNN is trained on the output of another deep CNN with randomly-initialised parameters. Interestingly, we find that, despite their hierarchical structure, the functions generated by deep CNNs are too rich to be efficiently learnable in high dimension.
    A Bias-Variance-Privacy Trilemma for Statistical Estimation. (arXiv:2301.13334v1 [math.ST])
    The canonical algorithm for differentially private mean estimation is to first clip the samples to a bounded range and then add noise to their empirical mean. Clipping controls the sensitivity and, hence, the variance of the noise that we add for privacy. But clipping also introduces statistical bias. We prove that this tradeoff is inherent: no algorithm can simultaneously have low bias, low variance, and low privacy loss for arbitrary distributions. On the positive side, we show that unbiased mean estimation is possible under approximate differential privacy if we assume that the distribution is symmetric. Furthermore, we show that, even if we assume that the data is sampled from a Gaussian, unbiased mean estimation is impossible under pure or concentrated differential privacy.
    Learning Data Representations with Joint Diffusion Models. (arXiv:2301.13622v1 [cs.LG])
    We introduce a joint diffusion model that simultaneously learns meaningful internal representations fit for both generative and predictive tasks. Joint machine learning models that allow synthesizing and classifying data often offer uneven performance between those tasks or are unstable to train. In this work, we depart from a set of empirical observations that indicate the usefulness of internal representations built by contemporary deep diffusion-based generative models in both generative and predictive settings. We then introduce an extension of the vanilla diffusion model with a classifier that allows for stable joint training with shared parametrization between those objectives. The resulting joint diffusion model offers superior performance across various tasks, including generative modeling, semi-supervised classification, and domain adaptation.
    Kernel Stein Discrepancy thinning: a theoretical perspective of pathologies and a practical fix with regularization. (arXiv:2301.13528v1 [math.ST])
    Stein thinning is a promising algorithm proposed by (Riabiz et al., 2022) for post-processing outputs of Markov chain Monte Carlo (MCMC). The main principle is to greedily minimize the kernelized Stein discrepancy (KSD), which only requires the gradient of the log-target distribution, and is thus well-suited for Bayesian inference. The main advantages of Stein thinning are the automatic remove of the burn-in period, the correction of the bias introduced by recent MCMC algorithms, and the asymptotic properties of convergence towards the target distribution. Nevertheless, Stein thinning suffers from several empirical pathologies, which may result in poor approximations, as observed in the literature. In this article, we conduct a theoretical analysis of these pathologies, to clearly identify the mechanisms at stake, and suggest improved strategies. Then, we introduce the regularized Stein thinning algorithm to alleviate the identified pathologies. Finally, theoretical guarantees and extensive experiments show the high efficiency of the proposed algorithm.
    Differentially Private Distributed Bayesian Linear Regression with MCMC. (arXiv:2301.13778v1 [stat.ML])
    We propose a novel Bayesian inference framework for distributed differentially private linear regression. We consider a distributed setting where multiple parties hold parts of the data and share certain summary statistics of their portions in privacy-preserving noise. We develop a novel generative statistical model for privately shared statistics, which exploits a useful distributional relation between the summary statistics of linear regression. Bayesian estimation of the regression coefficients is conducted mainly using Markov chain Monte Carlo algorithms, while we also provide a fast version to perform Bayesian estimation in one iteration. The proposed methods have computational advantages over their competitors. We provide numerical results on both real and simulated data, which demonstrate that the proposed algorithms provide well-rounded estimation and prediction.
    Combinatorial Causal Bandits without Graph Skeleton. (arXiv:2301.13392v1 [cs.LG])
    In combinatorial causal bandits (CCB), the learning agent chooses a subset of variables in each round to intervene and collects feedback from the observed variables to minimize expected regret or sample complexity. Previous works study this problem in both general causal models and binary generalized linear models (BGLMs). However, all of them require prior knowledge of causal graph structure. This paper studies the CCB problem without the graph structure on binary general causal models and BGLMs. We first provide an exponential lower bound of cumulative regrets for the CCB problem on general causal models. To overcome the exponentially large space of parameters, we then consider the CCB problem on BGLMs. We design a regret minimization algorithm for BGLMs even without the graph skeleton and show that it still achieves $O(\sqrt{T}\ln T)$ expected regret. This asymptotic regret is the same as the state-of-art algorithms relying on the graph structure. Moreover, we sacrifice the regret to $O(T^{\frac{2}{3}}\ln T)$ to remove the weight gap covered by the asymptotic notation. At last, we give some discussions and algorithms for pure exploration of the CCB problem without the graph structure.
    Misspecification-robust Sequential Neural Likelihood. (arXiv:2301.13368v1 [stat.ME])
    Simulation-based inference (SBI) techniques are now an essential tool for the parameter estimation of mechanistic and simulatable models with intractable likelihoods. Statistical approaches to SBI such as approximate Bayesian computation and Bayesian synthetic likelihood have been well studied in the well specified and misspecified settings. However, most implementations are inefficient in that many model simulations are wasted. Neural approaches such as sequential neural likelihood (SNL) have been developed that exploit all model simulations to build a surrogate of the likelihood function. However, SNL approaches have been shown to perform poorly under model misspecification. In this paper, we develop a new method for SNL that is robust to model misspecification and can identify areas where the model is deficient. We demonstrate the usefulness of the new approach on several illustrative examples.
    On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters. (arXiv:2301.13370v1 [cs.LG])
    Recent work has shown that automatic differentiation over the reals is almost always correct in a mathematically precise sense. However, actual programs work with machine-representable numbers (e.g., floating-point numbers), not reals. In this paper, we study the correctness of automatic differentiation when the parameter space of a neural network consists solely of machine-representable numbers. For a neural network with bias parameters, we prove that automatic differentiation is correct at all parameters where the network is differentiable. In contrast, it is incorrect at all parameters where the network is non-differentiable, since it never informs non-differentiability. To better understand this non-differentiable set of parameters, we prove a tight bound on its size, which is linear in the number of non-differentiabilities in activation functions, and provide a simple necessary and sufficient condition for a parameter to be in this set. We further prove that automatic differentiation always computes a Clarke subderivative, even on the non-differentiable set. We also extend these results to neural networks possibly without bias parameters.
    Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits. (arXiv:2301.13393v1 [cs.LG])
    Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
    Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization. (arXiv:2212.13556v2 [cs.LG] UPDATED)
    To date, no "information-theoretic" frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy "surrogate" algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
    Simplex Random Features. (arXiv:2301.13856v1 [stat.ML])
    We present Simplex Random Features (SimRFs), a new random feature (RF) mechanism for unbiased approximation of the softmax and Gaussian kernels by geometrical correlation of random projection vectors. We prove that SimRFs provide the smallest possible mean square error (MSE) on unbiased estimates of these kernels among the class of weight-independent geometrically-coupled positive random feature (PRF) mechanisms, substantially outperforming the previously most accurate Orthogonal Random Features at no observable extra cost. We present a more computationally expensive SimRFs+ variant, which we prove is asymptotically optimal in the broader family of weight-dependent geometrical coupling schemes (which permit correlations between random vector directions and norms). In extensive empirical studies, we show consistent gains provided by SimRFs in settings including pointwise kernel estimation, nonparametric classification and scalable Transformers.
    Sequential Kernelized Independence Testing. (arXiv:2212.07383v2 [stat.ML] UPDATED)
    Independence testing is a fundamental and classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) allow stopping earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. It is well known that classical batch tests are not tailored for streaming data settings: valid inference after data peeking requires correcting for multiple testing but such corrections generally result in low power. Following the principle of testing by betting, we design sequential kernelized independence tests (SKITs) that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g, the Hilbert-Schmidt independence criterion. Our test is valid under non-i.i.d. time-varying settings, for which there exist no batch tests. We demonstrate the power of our approaches on both simulated and real data.
    Differentially Private Kernel Inducing Points (DP-KIP) for Privacy-preserving Data Distillation. (arXiv:2301.13389v1 [cs.LG])
    While it is tempting to believe that data distillation preserves privacy, distilled data's empirical robustness against known attacks does not imply a provable privacy guarantee. Here, we develop a provably privacy-preserving data distillation algorithm, called differentially private kernel inducing points (DP-KIP). DP-KIP is an instantiation of DP-SGD on kernel ridge regression (KRR). Following a recent work, we use neural tangent kernels and minimize the KRR loss to estimate the distilled datapoints (i.e., kernel inducing points). We provide a computationally efficient JAX implementation of DP-KIP, which we test on several popular image and tabular datasets to show its efficacy in data distillation with differential privacy guarantees.
    Population-wise Labeling of Sulcal Graphs using Multi-graph Matching. (arXiv:2301.13532v1 [stat.ML])
    Population-wise matching of the cortical fold is necessary to identify biomarkers of neurological or psychiatric disorders. The difficulty comes from the massive interindividual variations in the morphology and spatial organization of the folds. This task is challenging at both methodological and conceptual levels. In the widely used registration-based techniques, these variations are considered as noise and the matching of folds is only implicit. Alternative approaches are based on the extraction and explicit identification of the cortical folds. In particular, representing cortical folding patterns as graphs of sulcal basins-termed sulcal graphs-enables to formalize the task as a graph-matching problem. In this paper, we propose to address the problem of sulcal graph matching directly at the population level using multi-graph matching techniques. First, we motivate the relevance of multi-graph matching framework in this context. We then introduce a procedure to generate populations of artificial sulcal graphs, which allows us benchmarking several state of the art multi-graph matching methods. Our results on both artificial and real data demonstrate the effectiveness of multi-graph matching techniques to obtain a population-wise consistent labeling of cortical folds at the sulcal basins level.
    Personalized Decentralized Bilevel Optimization over Random Directed Networks. (arXiv:2210.02129v2 [stat.ML] UPDATED)
    Personalization and decentralization are two major lines of studies to realize practical federated learning in the real world. The aim of this study is to establish a general and unified approach that can solve these two problems simultaneously. In this work, we first propose a bilevel problem that can adapt to various personalization scenarios by allowing an arbitrary choice of two parameters: a client-wise outer-parameter representing heterogeneity, and a shared inner-parameter representing homogeneity across client data distributions. We then present an algorithm that can solve this bilevel problem in a decentralized manner by estimating gradients of clients' outer-costs with respect to their outer-parameters. We show that the proposed algorithm can be extended to handle a random directed network, which is one of the most robust decentralized communication classes. The proposed method achieves state-of-the-art performance on a personalization benchmark across various communication settings.
    Bayesian Learning for Neural Networks: an algorithmic survey. (arXiv:2211.11865v4 [stat.ML] UPDATED)
    The last decade witnessed a growing interest in Bayesian learning. Yet, the technicality of the topic and the multitude of ingredients involved therein, besides the complexity of turning theory into practical implementations, limit the use of the Bayesian learning paradigm, preventing its widespread adoption across different fields and applications. This self-contained survey engages and introduces readers to the principles and algorithms of Bayesian Learning for Neural Networks. It provides an introduction to the topic from an accessible, practical-algorithmic perspective. Upon providing a general introduction to Bayesian Neural Networks, we discuss and present both standard and recent approaches for Bayesian inference, with an emphasis on solutions relying on Variational Inference and the use of Natural gradients. We also discuss the use of manifold optimization as a state-of-the-art approach to Bayesian learning. We examine the characteristic properties of all the discussed methods, and provide pseudo-codes for their implementation, paying attention to practical aspects, such as the computation of the gradients.
    Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments. (arXiv:2301.13446v1 [cs.LG])
    We study variance-dependent regret bounds for Markov decision processes (MDPs). Algorithms with variance-dependent regret guarantees can automatically exploit environments with low variance (e.g., enjoying constant regret on deterministic MDPs). The existing algorithms are either variance-independent or suboptimal. We first propose two new environment norms to characterize the fine-grained variance properties of the environment. For model-based methods, we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new analysis techniques show to this algorithm enjoys variance-dependent bounds with respect to our proposed norms. In particular, this bound is simultaneously minimax optimal for both stochastic and deterministic MDPs, the first result of its kind. We further initiate the study on model-free algorithms with variance-dependent regret bounds by designing a reference-function-based algorithm with a novel capped-doubling reference update schedule. Lastly, we also provide lower bounds to complement our upper bounds.  ( 2 min )
    Continuous Soft Pseudo-Labeling in ASR. (arXiv:2211.06007v2 [cs.LG] UPDATED)
    Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.  ( 2 min )
    Scaling laws for single-agent reinforcement learning. (arXiv:2301.13442v1 [cs.LG])
    Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law. One challenge in extending these results to reinforcement learning is that the main performance objective of interest, mean episode return, need not vary smoothly. To overcome this, we introduce *intrinsic performance*, a monotonic function of the return defined as the minimum compute required to achieve the given return across a family of models of different sizes. We find that, across a range of environments, intrinsic performance scales as a power law in model size and environment interactions. Consequently, as in generative modeling, the optimal model size scales as a power law in the training compute budget. Furthermore, we study how this relationship varies with the environment and with other properties of the training setup. In particular, using a toy MNIST-based environment, we show that varying the "horizon length" of the task mostly changes the coefficient but not the exponent of this relationship.  ( 2 min )
    Causal Estimation for Text Data with (Apparent) Overlap Violations. (arXiv:2210.00079v2 [stat.ML] UPDATED)
    Consider the problem of estimating the causal effect of some attribute of a text document; for example: what effect does writing a polite vs. rude email have on response time? To estimate a causal effect from observational data, we need to adjust for confounding aspects of the text that affect both the treatment and outcome -- e.g., the topic or writing level of the text. These confounding aspects are unknown a priori, so it seems natural to adjust for the entirety of the text (e.g., using a transformer). However, causal identification and estimation procedures rely on the assumption of overlap: for all levels of the adjustment variables, there is randomness leftover so that every unit could have (not) received treatment. Since the treatment here is itself an attribute of the text, it is perfectly determined, and overlap is apparently violated. The purpose of this paper is to show how to handle causal identification and obtain robust causal estimation in the presence of apparent overlap violations. In brief, the idea is to use supervised representation learning to produce a data representation that preserves confounding information while eliminating information that is only predictive of the treatment. This representation then suffices for adjustment and can satisfy overlap. Adapting results on non-parametric estimation, we find that this procedure is robust to conditional outcome misestimation, yielding a low-bias estimator with valid uncertainty quantification under weak conditions. Empirical results show strong improvements in bias and uncertainty quantification relative to the natural baseline.  ( 2 min )
    Unifying Generative Models with GFlowNets and Beyond. (arXiv:2209.02606v2 [cs.LG] UPDATED)
    There are many frameworks for deep generative modeling, each often presented with their own specific training algorithms and inference methods. Here, we demonstrate the connections between existing deep generative models and the recently introduced GFlowNet framework, a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories. Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models. Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.  ( 2 min )
    Understanding Self-Distillation in the Presence of Label Noise. (arXiv:2301.13304v1 [cs.LG])
    Self-distillation (SD) is the process of first training a \enquote{teacher} model and then using its predictions to train a \enquote{student} model with the \textit{same} architecture. Specifically, the student's objective function is $\big(\xi*\ell(\text{teacher's predictions}, \text{ student's predictions}) + (1-\xi)*\ell(\text{given labels}, \text{ student's predictions})\big)$, where $\ell$ is some loss function and $\xi$ is some parameter $\in [0,1]$. Empirically, SD has been observed to provide performance gains in several settings. In this paper, we theoretically characterize the effect of SD in two supervised learning problems with \textit{noisy labels}. We first analyze SD for regularized linear regression and show that in the high label noise regime, the optimal value of $\xi$ that minimizes the expected error in estimating the ground truth parameter is surprisingly greater than 1. Empirically, we show that $\xi > 1$ works better than $\xi \leq 1$ even with the cross-entropy loss for several classification datasets when 50\% or 30\% of the labels are corrupted. Further, we quantify when optimal SD is better than optimal regularization. Next, we analyze SD in the case of logistic regression for binary classification with random label corruption and quantify the range of label corruption in which the student outperforms the teacher in terms of accuracy. To our knowledge, this is the first result of its kind for the cross-entropy loss.  ( 2 min )
    Hierarchically Clustered PCA, LLE, and CCA via a Convex Clustering Penalty. (arXiv:2211.16553v2 [cs.LG] UPDATED)
    We introduce an unsupervised learning approach that combines the truncated singular value decomposition with convex clustering to estimate within-cluster directions of maximum variance/covariance (in the variables) while simultaneously hierarchically clustering (on observations). In contrast to previous work on joint clustering and embedding, our approach has a straightforward formulation, is readily scalable via distributed optimization, and admits a direct interpretation as hierarchically clustered principal component analysis (PCA), hierarchically clustered locally linear embedding (LLE), or hierarchically clustered canonical correlation analysis (CCA). Through numerical experiments and real-world examples relevant to precision medicine, we show that our approach outperforms traditional and contemporary clustering methods on both underdetermined problems ($p \gg N$ with tens of observations) and on large datasets (e.g., $N=100,000$) while yielding interpretable dendrograms of hierarchical per-cluster principal components or canonical variates.  ( 2 min )
    Learning Against Distributional Uncertainty: On the Trade-off Between Robustness and Specificity. (arXiv:2301.13565v1 [cs.LG])
    Trustworthy machine learning aims at combating distributional uncertainties in training data distributions compared to population distributions. Typical treatment frameworks include the Bayesian approach, (min-max) distributionally robust optimization (DRO), and regularization. However, two issues have to be raised: 1) All these methods are biased estimators of the true optimal cost; 2) the prior distribution in the Bayesian method, the radius of the distributional ball in the DRO method, and the regularizer in the regularization method are difficult to specify. This paper studies a new framework that unifies the three approaches and that addresses the two challenges mentioned above. The asymptotic properties (e.g., consistency and asymptotic normalities), non-asymptotic properties (e.g., unbiasedness and generalization error bound), and a Monte--Carlo-based solution method of the proposed model are studied. The new model reveals the trade-off between the robustness to the unseen data and the specificity to the training data.  ( 2 min )
    Exploring QSAR Models for Activity-Cliff Prediction. (arXiv:2301.13644v1 [cs.LG])
    Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivity when the actual activity of one of the compounds is given. Graph isomorphism features are found to be competitive with or superior to classical molecular representations for AC-classification and can thus be employed as baseline AC-prediction models or simple compound-optimisation tools. For general QSAR-prediction, however, extended-connectivity fingerprints still consistently deliver the best performance. Our results provide strong support for the hypothesis that indeed QSAR methods frequently fail to predict ACs. We propose twin-network training for deep learning models as a potential future pathway to increase AC-sensitivity and thus overall QSAR performance.
    Bayesian Calibration of Imperfect Computer Models using Physics-Informed Priors. (arXiv:2201.06463v4 [stat.ML] UPDATED)
    We introduce a computational efficient data-driven framework suitable for quantifying the uncertainty in physical parameters and model formulation of computer models, represented by differential equations. We construct physics-informed priors, which are multi-output GP priors that encode the model's structure in the covariance function. This is extended into a fully Bayesian framework that quantifies the uncertainty of physical parameters and model predictions. Since physical models often are imperfect descriptions of the real process, we allow the model to deviate from the observed data by considering a discrepancy function. For inference, Hamiltonian Monte Carlo is used. Further, approximations for big data are developed that reduce the computational complexity from $\mathcal{O}(N^3)$ to $\mathcal{O}(N\cdot m^2),$ where $m \ll N.$ Our approach is demonstrated in simulation and real data case studies where the physics are described by time-dependent ODEs describe (cardiovascular models) and space-time dependent PDEs (heat equation). In the studies, it is shown that our modelling framework can recover the true parameters of the physical models in cases where 1) the reality is more complex than our modelling choice and 2) the data acquisition process is biased while also producing accurate predictions. Furthermore, it is demonstrated that our approach is computationally faster than traditional Bayesian calibration methods.  ( 2 min )
    An $l_1$-oracle inequality for the Lasso in high-dimensional mixtures of experts models. (arXiv:2009.10622v5 [math.ST] UPDATED)
    Mixtures of experts (MoE) models are a popular framework for modeling heterogeneity in data, for both regression and classification problems in statistics and machine learning, due to their flexibility and the abundance of available statistical estimation and model choice tools. Such flexibility comes from allowing the mixture weights (or gating functions) in the MoE model to depend on the explanatory variables, along with the experts (or component densities). This permits the modeling of data arising from more complex data generating processes when compared to the classical finite mixtures and finite mixtures of regression models, whose mixing parameters are independent of the covariates. The use of MoE models in a high-dimensional setting, when the number of explanatory variables can be much larger than the sample size, is challenging from a computational point of view, and in particular from a theoretical point of view, where the literature is still lacking results for dealing with the curse of dimensionality, for both the statistical estimation and feature selection problems. We consider the finite MoE model with soft-max gating functions and Gaussian experts for high-dimensional regression on heterogeneous data, and its $l_1$-regularized estimation via the Lasso. We focus on the Lasso estimation properties rather than its feature selection properties. We provide a lower bound on the regularization parameter of the Lasso function that ensures an $l_1$-oracle inequality satisfied by the Lasso estimator according to the Kullback--Leibler loss.  ( 2 min )
    Optimal Transport Perturbations for Safe Reinforcement Learning with Robustness Guarantees. (arXiv:2301.13375v1 [cs.LG])
    Robustness and safety are critical for the trustworthy deployment of deep reinforcement learning in real-world decision making applications. In particular, we require algorithms that can guarantee robust, safe performance in the presence of general environment disturbances, while making limited assumptions on the data collection process during training. In this work, we propose a safe reinforcement learning framework with robustness guarantees through the use of an optimal transport cost uncertainty set. We provide an efficient, theoretically supported implementation based on Optimal Transport Perturbations, which can be applied in a completely offline fashion using only data collected in a nominal training environment. We demonstrate the robust, safe performance of our approach on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.  ( 2 min )
    A Unified Causal View of Domain Invariant Representation Learning. (arXiv:2208.06987v3 [stat.ML] UPDATED)
    Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. There are a wide range of proposals for mitigating this problem by learning representations that are ``invariant'' in some sense.However, these methods generally contradict each other, and none of them consistently improve performance on real-world domain shift benchmarks. There are two main questions that must be addressed to understand when, if ever, we should use each method. First, how does each ad hoc notion of ``invariance'' relate to the structure of real-world problems? And, second, when does learning invariant representations actually yield robust models? To address these issues, we introduce a broad formal notion of what it means for a real-world domain shift to admit invariant structure. Then, we characterize the causal structures that are compatible with this notion of invariance.With this in hand, we find conditions under which method-specific invariance notions correspond to real-world invariant structure, and we clarify the relationship between invariant structure and robustness to domain shifts. For both questions, we find that the true underlying causal structure of the data plays a critical role.  ( 2 min )
    DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R. (arXiv:2103.09603v4 [stat.ML] UPDATED)
    The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). It provides functionalities to estimate parameters in causal models based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML makes it possible to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This paper serves as an introduction to the double machine learning framework and the R package DoubleML. In reproducible code examples with simulated and real data sets, we demonstrate how DoubleML users can perform valid inference based on machine learning methods.  ( 2 min )
    Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond. (arXiv:2301.13486v1 [stat.ML])
    In this work we study the robustness to adversarial attacks, of early-stopping strategies on gradient-descent (GD) methods for linear regression. More precisely, we show that early-stopped GD is optimally robust (up to an absolute constant) against Euclidean-norm adversarial attacks. However, we show that this strategy can be arbitrarily sub-optimal in the case of general Mahalanobis attacks. This observation is compatible with recent findings in the case of classification~\cite{Vardi2022GradientMP} that show that GD provably converges to non-robust models. To alleviate this issue, we propose to apply instead a GD scheme on a transformation of the data adapted to the attack. This data transformation amounts to apply feature-depending learning rates and we show that this modified GD is able to handle any Mahalanobis attack, as well as more general attacks under some conditions. Unfortunately, choosing such adapted transformations can be hard for general attacks. To the rescue, we design a simple and tractable estimator whose adversarial risk is optimal up to within a multiplicative constant of 1.1124 in the population regime, and works for any norm.  ( 2 min )
    Learning in POMDPs is Sample-Efficient with Hindsight Observability. (arXiv:2301.13857v1 [cs.LG])
    POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.  ( 2 min )
    Gaussian Noise is Nearly Instance Optimal for Private Unbiased Mean Estimation. (arXiv:2301.13850v1 [math.ST])
    We investigate unbiased high-dimensional mean estimators in differential privacy. We consider differentially private mechanisms whose expected output equals the mean of the input dataset, for every dataset drawn from a fixed convex domain $K$ in $\mathbb{R}^d$. In the setting of concentrated differential privacy, we show that, for every input such an unbiased mean estimator introduces approximately at least as much error as a mechanism that adds Gaussian noise with a carefully chosen covariance. This is true when the error is measured with respect to $\ell_p$ error for any $p \ge 2$. We extend this result to local differential privacy, and to approximate differential privacy, but for the latter the error lower bound holds either for a dataset or for a neighboring dataset. We also extend our results to mechanisms that take i.i.d.~samples from a distribution over $K$ and are unbiased with respect to the mean of the distribution.  ( 2 min )
    On the Statistical Benefits of Temporal Difference Learning. (arXiv:2301.13289v1 [cs.LG])
    Given a dataset on actions and resulting long-term rewards, a direct estimation approach fits value functions that minimize prediction error on the training data. Temporal difference learning (TD) methods instead fit value functions by minimizing the degree of temporal inconsistency between estimates made at successive time-steps. Focusing on finite state Markov chains, we provide a crisp asymptotic theory of the statistical advantages of this approach. First, we show that an intuitive inverse trajectory pooling coefficient completely characterizes the percent reduction in mean-squared error of value estimates. Depending on problem structure, the reduction could be enormous or nonexistent. Next, we prove that there can be dramatic improvements in estimates of the difference in value-to-go for two states: TD's errors are bounded in terms of a novel measure - the problem's trajectory crossing time - which can be much smaller than the problem's time horizon.  ( 2 min )
    Fairness and Accuracy under Domain Generalization. (arXiv:2301.13323v1 [cs.LG])
    As machine learning (ML) algorithms are increasingly used in high-stakes applications, concerns have arisen that they may be biased against certain social groups. Although many approaches have been proposed to make ML models fair, they typically rely on the assumption that data distributions in training and deployment are identical. Unfortunately, this is commonly violated in practice and a model that is fair during training may lead to an unexpected outcome during its deployment. Although the problem of designing robust ML models under dataset shifts has been widely studied, most existing works focus only on the transfer of accuracy. In this paper, we study the transfer of both fairness and accuracy under domain generalization where the data at test time may be sampled from never-before-seen domains. We first develop theoretical bounds on the unfairness and expected loss at deployment, and then derive sufficient conditions under which fairness and accuracy can be perfectly transferred via invariant representation learning. Guided by this, we design a learning algorithm such that fair ML models learned with training data still have high fairness and accuracy when deployment environments change. Experiments on real-world data validate the proposed algorithm. Model implementation is available at https://github.com/pth1993/FATDM.  ( 2 min )
    Zero-shot-Learning Cross-Modality Data Translation Through Mutual Information Guided Stochastic Diffusion. (arXiv:2301.13743v1 [cs.CV])
    Cross-modality data translation has attracted great interest in image computing. Deep generative models (\textit{e.g.}, GANs) show performance improvement in tackling those problems. Nevertheless, as a fundamental challenge in image translation, the problem of Zero-shot-Learning Cross-Modality Data Translation with fidelity remains unanswered. This paper proposes a new unsupervised zero-shot-learning method named Mutual Information guided Diffusion cross-modality data translation Model (MIDiffusion), which learns to translate the unseen source data to the target domain. The MIDiffusion leverages a score-matching-based generative model, which learns the prior knowledge in the target domain. We propose a differentiable local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling. The $LMI$ captures the identical cross-modality features in the statistical domain for the diffusion guidance; thus, our method does not require retraining when the source domain is changed, as it does not rely on any direct mapping between the source and target domains. This advantage is critical for applying cross-modality data translation methods in practice, as a reasonable amount of source domain dataset is not always available for supervised training. We empirically show the advanced performance of MIDiffusion in comparison with an influential group of generative models, including adversarial-based and other score-matching-based models.  ( 2 min )
    On the Initialisation of Wide Low-Rank Feedforward Neural Networks. (arXiv:2301.13710v1 [stat.ML])
    The edge-of-chaos dynamics of wide randomly initialized low-rank feedforward networks are analyzed. Formulae for the optimal weight and bias variances are extended from the full-rank to low-rank setting and are shown to follow from multiplicative scaling. The principle second order effect, the variance of the input-output Jacobian, is derived and shown to increase as the rank to width ratio decreases. These results inform practitioners how to randomly initialize feedforward networks with a reduced number of learnable parameters while in the same ambient dimension, allowing reductions in the computational cost and memory constraints of the associated network.  ( 2 min )
    The passive symmetries of machine learning. (arXiv:2301.13724v1 [stat.ML])
    Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another. These are the passive symmetries; they include coordinate freedom, gauge symmetry and units covariance, all of which have led to important results in physics. Our goal is to understand the implications of passive symmetries for machine learning: Which passive symmetries play a role (e.g., permutation symmetry in graph neural networks)? What are dos and don'ts in machine learning practice? We assay conditions under which passive symmetries can be implemented as group equivariances. We also discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. While this paper is purely conceptual, we believe that it can have a significant impact on helping machine learning make the transition that took place for modern physics in the first half of the Twentieth century.  ( 2 min )
    Variational sparse inverse Cholesky approximation for latent Gaussian processes via double Kullback-Leibler minimization. (arXiv:2301.13303v1 [stat.ML])
    To achieve scalable and accurate inference for latent Gaussian processes, we propose a variational approximation based on a family of Gaussian distributions whose covariance matrices have sparse inverse Cholesky (SIC) factors. We combine this variational approximation of the posterior with a similar and efficient SIC-restricted Kullback-Leibler-optimal approximation of the prior. We then focus on a particular SIC ordering and nearest-neighbor-based sparsity pattern resulting in highly accurate prior and posterior approximations. For this setting, our variational approximation can be computed via stochastic gradient descent in polylogarithmic time per iteration. We provide numerical comparisons showing that the proposed double-Kullback-Leibler-optimal Gaussian-process approximation (DKLGP) can sometimes be vastly more accurate than alternative approaches such as inducing-point and mean-field approximations at similar computational complexity.  ( 2 min )
    Fast Optimal Estimation with Intractable Models using Permutation-Invariant Neural Networks. (arXiv:2208.12942v2 [stat.ME] UPDATED)
    Neural networks have recently shown promise for likelihood-free inference, providing orders-of-magnitude speed-ups over classical methods. However, current implementations are suboptimal when estimating parameters from independent replicates. In this paper, we use a decision-theoretic framework to argue that permutation-invariant neural networks are ideally placed for constructing Bayes estimators for arbitrary models, provided that simulation from these models is straightforward. We show that the resulting neural Bayes estimators can quickly and optimally estimate parameters in weakly-identified and highly-parameterised models with relative ease, and that they are highly competitive and much faster than traditional likelihood-based estimators. We apply our estimator on a spatial analysis of sea-surface temperature in the Red Sea where, after training, we obtain parameter estimates, and uncertainty quantification of the estimates via bootstrap sampling, from hundreds of spatial fields in a fraction of a second.  ( 2 min )
    Optimal precision for GANs. (arXiv:2207.10541v2 [cs.LG] UPDATED)
    Many deep generative models are defined as a push-forward of a Gaussian measure by a continuous generator, such as Generative Adversarial Networks (GANs) or Variational Auto-Encoders (VAEs). This work explores the latent space of such deep generative models. A key issue with these models is their tendency to output samples outside of the support of the target distribution when learning disconnected distributions. We investigate the relationship between the performance of these models and the geometry of their latent space. Building on recent developments in geometric measure theory, we prove a sufficient condition for optimality in the case where the dimension of the latent space is larger than the number of modes. Through experiments on GANs, we demonstrate the validity of our theoretical results and gain new insights into the latent space geometry of these models. Additionally, we propose a truncation method that enforces a simplicial cluster structure in the latent space and improves the performance of GANs.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces I: the compact case. (arXiv:2208.14960v2 [stat.ME] UPDATED)
    Gaussian processes are arguably the most important model class in spatial statistics. They encode prior information about the modeled function and can be used for exact or approximate Bayesian inference. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 2 min )
    Demystifying Disagreement-on-the-Line in High Dimensions. (arXiv:2301.13371v1 [stat.ML])
    Evaluating the performance of machine learning models under distribution shift is challenging, especially when we only have unlabeled data from the shifted (target) domain, along with labeled data from the original (source) domain. Recent work suggests that the notion of disagreement, the degree to which two models trained with different randomness differ on the same input, is a key to tackle this problem. Experimentally, disagreement and prediction error have been shown to be strongly connected, which has been used to estimate model performance. Experiments have lead to the discovery of the disagreement-on-the-line phenomenon, whereby the classification error under the target domain is often a linear function of the classification error under the source domain; and whenever this property holds, disagreement under the source and target domain follow the same linear relation. In this work, we develop a theoretical foundation for analyzing disagreement in high-dimensional random features regression; and study under what conditions the disagreement-on-the-line phenomenon occurs in our setting. Experiments on CIFAR-10-C, Tiny ImageNet-C, and Camelyon17 are consistent with our theory and support the universality of the theoretical findings.  ( 2 min )
    Structure Learning and Parameter Estimation for Graphical Models via Penalized Maximum Likelihood Methods. (arXiv:2301.13269v1 [stat.ML])
    Probabilistic graphical models (PGMs) provide a compact and flexible framework to model very complex real-life phenomena. They combine the probability theory which deals with uncertainty and logical structure represented by a graph which allows one to cope with the computational complexity and also interpret and communicate the obtained knowledge. In the thesis, we consider two different types of PGMs: Bayesian networks (BNs) which are static, and continuous time Bayesian networks which, as the name suggests, have a temporal component. We are interested in recovering their true structure, which is the first step in learning any PGM. This is a challenging task, which is interesting in itself from the causal point of view, for the purposes of interpretation of the model and the decision-making process. All approaches for structure learning in the thesis are united by the same idea of maximum likelihood estimation with the LASSO penalty. The problem of structure learning is reduced to the problem of finding non-zero coefficients in the LASSO estimator for a generalized linear model. In the case of CTBNs, we consider the problem both for complete and incomplete data. We support the theoretical results with experiments.  ( 2 min )
    Near Optimal Private and Robust Linear Regression. (arXiv:2301.13273v1 [cs.LG])
    We study the canonical statistical estimation problem of linear regression from $n$ i.i.d.~examples under $(\varepsilon,\delta)$-differential privacy when some response variables are adversarially corrupted. We propose a variant of the popular differentially private stochastic gradient descent (DP-SGD) algorithm with two innovations: a full-batch gradient descent to improve sample complexity and a novel adaptive clipping to guarantee robustness. When there is no adversarial corruption, this algorithm improves upon the existing state-of-the-art approach and achieves a near optimal sample complexity. Under label-corruption, this is the first efficient linear regression algorithm to guarantee both $(\varepsilon,\delta)$-DP and robustness. Synthetic experiments confirm the superiority of our approach.  ( 2 min )

  • Open

    Reinforcement Learning to Control a 2D Quadcopter
    submitted by /u/Alyx1337 [link] [comments]  ( 40 min )
    Adding a Bonus to Q-function
    Hey, I'm trying to apply RL, precisely DQN, to recommender systems. I want to add a weighted bonus to my Q-function to favor some actions over others. This bonus is a function that depends on (s,a) just like the Q-function. Then, instead of deriving a greedy policy from the original Q, I derive a greedy policy for the bonused function Q'. For some reasons, I would really like to do this instead instead of modifying the reward function itself. My approach works well empirically, however I would want it to be more theoretically grounded. I think what I want ideally is to prove that Q' is optimal for some MDP with a slightly different reward function. And I surprisingly can't find papers doing that in the literature. There are some seemingly related things like Soft Actor Critic which adds an entropy bonus to the value function, but not much more. So, is it ok to tweak a Q-function like this? Is there something I should be careful of? Thank you for your help! submitted by /u/xalendrio [link] [comments]  ( 42 min )
    Multi-Agents Soccer Competition ⚽ (Deep Reinforcement Learning Course by Hugging Face 🤗)
    Hey there 👋 We published the ⚔️ AI vs. AI challenge⚔️, a deep reinforcement learning multi-agents competition. You’ll learn about Multi-agent Reinforcement Learning (MARL), you’ll train your agents to play soccer and you’re going to participate in AI vs. AI challenge where your trained agent will compete against other classmates’ agents every day and be ranked on a new leaderboard. You don’t need to participate in the course to be able to participate in the competition. You can start here 👉 https://huggingface.co/deep-rl-course/unit7/introduction 🏆 The leaderboard 👉 https://huggingface.co/spaces/huggingface-projects/AIvsAI-SoccerTwos 👀 Visualize your agent competing with our demo 👉https://huggingface.co/spaces/unity/SoccerTwos We also created a discord channel, ai-vs-ai-competition to exchange with others and share advice, you can join our discord server here 👉 hf.co/discord/join https://preview.redd.it/4dc01ktqnlfa1.png?width=1920&format=png&auto=webp&s=c10ff68884683373c631648725aa166364d9494d If you have questions or feedback, I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 41 min )
    Stable Baseines3 not logging success rate for rollouts?
    I am training a PPO agent and have added the appropriate info to my custom environment to log success rate. This now means I am getting logs from my evaluation of success rate but I am not getting this for rollouts. I expect this is due to my environment creation code. As far as I can tell I've done what the documentation says. Am I missing something? env = SubprocVecEnv([lambda: Monitor(SumoEnv(gui=gui), info_keywords=("is_success",)) for i in range(num_envs)], start_method="spawn") submitted by /u/centripetalstranger [link] [comments]  ( 41 min )
    A discrete action in the continuous space?
    I’m currently using stable baselines3 and have 2 options in terms of action. Some acceleration values or a lane change. I broke this down into discrete actions. However, I want to use a continuous action space now and I can’t use a dict action space to keep the accelerations separate. Is it reasonable to make for example 1-3 in the continuous space represent accelerations and then 3-4 represent a lane change? By this I mean, should the agent pick 3.1, 3.2, 3.3, they will all result in a lane change. Will this cause issues? submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Share Your Reinforcement Learning Interview Questions
    Hi, I have a technical interview incoming for a entry/mid level RL Engineer position. I have just finished my masters degree and have some experience in RL from previous internships. I am currently using CS 285 at UC Berkeley online lectures to prepare for the interview. I would really appreciate it if you can share the questions that you have faced or asked for similar positions or can recommend what specifics should I focus on for such interviews. submitted by /u/ZIGGY-Zz [link] [comments]  ( 42 min )
    ChatGPT and RL
    Hi I am trying to do classification (nope, it is one-step prediction and it is not sequence dependent). Question: the success of chatgpt, does this mean that we can also use RL (PPO) for just classification? My understanding is that we can just use supervised learning. Does PPO (RL) helps in this case? Thanks? submitted by /u/Dense-Smf-6032 [link] [comments]  ( 42 min )
    Scaling laws for single-agent reinforcement learning (OpenAI)
    submitted by /u/goolulusaurs [link] [comments]  ( 40 min )
    What does the output of the actor network should generally represent?
    Hi, I’m trying to understand some basic concepts of RL. I’m developing a model that should predict the sum of future rewards for any given state (simplified version of bellman’s equation). Then it should compare the actual future reward and it’s prediction with the loss function and backpropagate. This seems to be pretty standard. What I’m not getting, is that when I’m generating my batch of data (for the offline training), I think that the standard should be to choose the action based on a categorical distribution of the predictions for each action (or use epsilon greedy). The problem is that if i have any negative prediction, even if it’s random, it will never reach that state and never update based on it. Is that right? Is it how it’s supposed to be or am I having the wrong concept of what the network should output. Thanks in advance! submitted by /u/enzodtz [link] [comments]  ( 44 min )
    What does the following loss function mean in reinforcement learning?
    I am not an expert on policy learning or reinforcement learning, and I am studying this paper "https://openaccess.thecvf.com/content/ICCV2021/papers/Sun_Dynamic_Network_Quantization_for_Efficient_Video_Inference_ICCV_2021_paper.pdf" right now, but got confused by eq(11) The captial Omega is the action space and small "a_i" is the i-th action in a trajectory. T is the total number of actions in this trajectory,the small k is a selected action from the action space "Omega". I don't understand what symbol it is after the 2nd summation (\sum)in the equation, and I am not sure if it is an "L" or "I". Could someone provide some guidance on what the following function might mean, and why it can be used to achieve balanced policy usage? https://preview.redd.it/n4deietkaifa1.png?width=958&format=png&auto=webp&s=919453f2d7516bd2908c6a23fe6aa1e6fc04edf8 submitted by /u/AaronSpalding [link] [comments]  ( 43 min )
    RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)
    I am trying to write my own actor critic algorithm. Unlike other implementations, I tried to keep a separate actor and critic network. The problem arises somewhere in my actor or critic loss function ​ The error is originating here - advantage = nrml_disc_rewards-values critic_loss = advantage.pow(2).mean() actor_loss = -(torch.sum(torch.log(prob_batch)*advantage)) policy_opt.zero_grad() actor_loss.backward() policy_opt.step() value_opt.zero_grad() critic_loss.backward() value_opt.step() This is the full traceback - D:\q_learning\actor_critic.py:90: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at C:\cb\pytorch_100000…  ( 44 min )
  • Open

    [D] Any open source model, or application to remove no speech parts of a video?
    Currently I am using Davinci Resolve free edition to manually cut / remove no speech parts, or the parts where I take a breath It is extremely time consuming I am pretty sure this can be done via AI For example whisper is able to detect where we use filler words such as umh, um, uh etc That would be awesome to automatically remove these parts from a video Just direct me where to look thank you submitted by /u/CeFurkan [link] [comments]  ( 43 min )
    [N] OpenAI starts selling subscriptions to its ChatGPT bot
    https://www.axios.com/2023/02/01/chatgpt-subscriptions-chatbot-openai Not fully paywalled, but there's a tiering system. submitted by /u/bikeskata [link] [comments]  ( 42 min )
    [D] Normalizing Flows in 2023?
    What is the state of research in normalizing flows in 2023? Have they been superseded by diffusion models for sample generation? If so, what are some other applications where normalizing flows are still SOTA (or even useful)? submitted by /u/wellfriedbeans [link] [comments]  ( 42 min )
    [D] Advice for a multi-label classification problem
    Hi guys, I have a dataset of 12,000 products, each of which consists of a title, description, and some images. In addition, I also have a pre-defined set of product categories. Curious to learn if anyone has any suggestions on what model to be used to train using this dataset as input to classify each product in the dataset into the related categories within the given set? submitted by /u/dle88 [link] [comments]  ( 42 min )
    [D] Why is stable diffusion much smaller than predecessors?
    Stable diffusion seems to be a departure from the trend of building larger and larger models. It has 10x less parameters than other image generation models like DALLE-2. “Incredibly, compared with DALL-E 2 and Imagen, the Stable Diffusion model is a lot smaller. While DALL-E 2 has around 3.5 Billion parameters, and Imagen has 4.6 Billion, the first Stable Diffusion model has just 890 million parameters, which means it uses a lot less VRAM and can actually be run on consumer-grade graphics cards.” What allows stable diffusion to work so well with a lot less parameters? Are there any drawbacks to this, like requiring stable diffusion to be fine tuned more than DALLE-2 for example? submitted by /u/dahdarknite [link] [comments]  ( 43 min )
    [R] Extracting Training Data from Diffusion Models
    https://twitter.com/eric_wallace_/status/1620449934863642624?s=46&t=GVukPDI7944N8-waYE5qcw Extracting training data from diffusion models is possible by following, more or less, these steps: Compute CLIP embeddings for the images in a training dataset. Perform an all-pairs comparison and mark the pairs with l2 distance smaller than some threshold as near duplicates Use the prompts for training samples marked as near duplicates to generate N synthetic samples with the trained model Compute the all-pairs l2 distance between the embeddings of generated samples for a given training prompt. Build a graph where the nodes are generated samples and an edge exists if the l2 distance is less than some threshold. If the largest clique in the resulting graph is of size 10, then the training sample is considered to be memorized. Visually inspect the results to determine if the samples considered to be memorized are similar to the training data samples. With this method, the authors were able to find samples from Stable Diffusion and Imagen corresponding to copyrighted training images. submitted by /u/pm_me_your_pay_slips [link] [comments]  ( 45 min )
    [D] Vectorizing computation of the Jaccard similarity between all instances in a large dataset in Python
    I am trying to claculate the Jaccard similarity between all instances in my dataframe. I am using the following method to do so, however, this method is painfully slow. My ```data_with_labels``` shape is (221277, 217). # Compute the Jaccard similarity between all instances n_instances = data_with_labels.shape[0] jaccard_similarity_matrix = np.zeros((n_instances, n_instances)) for i in range(n_instances): for j in range(n_instances): jaccard_similarity_matrix[i, j] = jaccard_score(data_with_labels[i, :], data_with_labels[j, :], average='micro') ​ Is there any way to do this process with numpy vectorization? I tried soomething like this but keep getting this error: n_instances = data_with_labels.shape[0] jaccard_similarity_matrix = np.zeros((n_instances, n_instances)) for i in range(n_instances): jaccard_similarity_matrix[i, :] = jaccard_score(data_with_labels[i, :], data_with_labels, average='micro') ValueError: Found input variables with inconsistent numbers of samples: [217, 221277] ​ submitted by /u/hopedallas [link] [comments]  ( 43 min )
    [R] On the Expressive Power of Geometric Graph Neural Networks
    Geometric GNNs are an emerging class of GNNs for spatially embedded graphs in scientific and engineering applications, s.a. biomolecular structure, material science, and physical simulations. Notable examples include SchNet, DimeNet, Tensor Field Networks, and E(n) Equivariant GNNs. How powerful are geometric GNNs? How do key design choices influence expressivity and how to build maximally powerful ones? Check out this recent paper for more: 📄 PDF: http://arxiv.org/abs/2301.09308 💻 Code: http://github.com/chaitjo/geometric-gnn-dojo 💡Key findings: https://twitter.com/chaitjo/status/1617812402632019968 P.S. Are you new to Geometric GNNs, GDL, PyTorch Geometric, etc.? Want to understand how theory/equations connect to real code? Try this Geometric GNN 101 notebook before diving in: https://github.com/chaitjo/geometric-gnn-dojo/blob/main/geometric_gnn_101.ipynb submitted by /u/chaitjo [link] [comments]  ( 43 min )
    [P] An open source tool for repeatable PyTorch experiments by embedding your code in each model checkpoint
    I made a new open source tool called JellyML that lets you go back to any of your checkpoints, and reproduce your code exactly as it was when you trained it. You can find the website here: https://jellyml.com The GitHub repo: https://gitHub.com/mmulet/jellyml You can install it with pip: pip install jellyml submitted by /u/latefordinnerstudios [link] [comments]  ( 42 min )
    [D] What does a DL role look like in ten years?
    Every day, there seems to be new evidence of the generalization capabilities of LLMs. What does this mean for the future role of deep learning experts in academia and business? It seems like there's a significant chance that skills such as PyTorch and Jax will be displaced by prompt construction and off-the-shelf model APIs, with only a few large institutions working on the DNN itself. Curious to hear others' thoughts on this. submitted by /u/PassingTumbleweed [link] [comments]  ( 47 min )
    [D] Tortoise TTS API for GPT-3.
    Hey everyone, I thought of an idea to create a human like realistic voice assistant for ChatGPT. So I have a question that can we make an API of tortoise TTS trained on a specific voice. I've seen a lot of companies nowadays that provides most realistic text to speech solutions like eleven labs etc. Do they train these voices on tortoise TTS?? If there is another way of creating highly realistic voices and make an API of it, then please tell me how can I do it? And also how can I make this process fast as regular normal TTS? submitted by /u/akshaysri0001 [link] [comments]  ( 43 min )
    [P] predictive modeling- Multi stage classification
    Problem statement: assume a user come into a system and it typically takes 10 weeks for outcome(yes,no). I want to build a model which predicts the outcome on any particular week say how likely are they gonna succeed on week 1,2,3 etc. Question on model building approach: should I build weekly models and get the prediction ? Or is there a better way to do it. Ideally it would be great have single model that can be used for different weeks. I prefer the latter. Appreciate your ideas submitted by /u/R-PRADY [link] [comments]  ( 42 min )
    [P] NER output label post processing
    I’m looking to some aggregation on academic research and news articles to see what insights I get from it. I’m using textrazor to do named entity recognition on the documents, but getting a lot of dirty labels that have slightly different wording. For example, Tesla, Tesla ltd, Tesla Ltd. As a result, my aggregations have a lot of duplicate results. The dataset consists of about 4M labels so the solution has to be efficient to be viable. I was thinking of putting the labels through word2vec and then clustering them based on the word embedding distances? But then the problem arises of how many clusters to use? I’ve also tried simple regex preprocessing to get rid of the company abbreviations but there are other examples that cannot be solved that easily. submitted by /u/hasiemasie [link] [comments]  ( 43 min )
    [D] A report that compares the practices of high-performing companies in Europe to laggards in AI adoption
    Discussion about a report that compares the practices and attitudes of companies that self-report as ahead of the competition in AI adoption in Europe, compared to companies that identify as behind or at the same stage as their competitors. It contains some interesting findings mixed with some somewhat obvious things. Kinda obvious that leading companies also are further ahead in using MLOps, but I thought it was interesting to see the frequency of fine-tuning and retraining. Not as obvious that most companies report a lack of access to training data, would have thought that is mostly something that smaller companies have issues with. Also not so obvious to me is that companies with a centralized decision-making related to AI seem to dominate among high-performers. Interesting that most companies seem to get some value out of their AI/ML projects, which seems to contradict some of the previous forecasts by the big consultancy companies. Link to the report: https://stagezero.ai/2022-survey-report/ submitted by /u/madnessone1 [link] [comments]  ( 43 min )
    [R] EMNLP video interviews, workshops, and posters
    I learned a lot at EMNLP in December and captured some of what I learned in this video. Interviews I asked five NLP researchers these questions: 1- What is the most exciting development in NLP in 2022 2- What are you looking forward to in 2023? 3- What is an underrated idea that the field should pay more attention to? Their answers start at 01:22. Workshops I got to spend time at these workshops: Generation, Evaluation & Metrics (GEM) Massively Multilingual NLU Blackbox NLP My main takeaways are at 09:25. Posters If you've been to a conference you'd know there's an overwhelming number of posters. I recorded four of the ones I came across and thought were interesting (covering retrieval-augmented text generation, human evaluation, the BLOOM multimodal dataset, and a multimodal method to name music playlists). Poster presentations start at 14:38 Full video: https://www.youtube.com/watch?v=plCvF_7qrmY ​ What's your answer to these questions? 1- What is the most exciting development in NLP in 2022 2- What are you looking forward to in 2023? 3- What is an underrated idea that the field should pay more attention to? ​ submitted by /u/jayalammar [link] [comments]  ( 43 min )
    [P] A CLI tool for easy transformer sequence classifier training and inference
    Hi everyone, I have developed a CLI tool to train a transformer sequence classification model. There are also options for preprocessing data and inference on new data. I was thinking that interesting use cases might be found within economics/finance and biological domains, and would be super interested in feedback on: - if the documentation is intelligible and enables you to use it - to which use cases from your industry/domain could discrete sequence modelling be applied - what additional features you'd need for it to be useful to you Basically, where would the prediction of a class (or the next item) based on discrete events/objects/tokens be useful? The project is called "sequifier" and can be found here: https://github.com/0xideas/sequifier submitted by /u/0xideas [link] [comments]  ( 43 min )
    [P] Self Hostable OpenAI Alternative
    Hi, Text-Generator.io is now self hostable, It's priced at $1000 USD per instance per year to self host. The service runs on a single 24GB VRAM GPU, and runs all services including speech to text, text and code generation for almost all languages and generating embeddings too. The text generator also downloads and analyses any input with links including documents, images, images with text inside and webpages for better understanding and to generate better text. It's a great alternative to OpenAI and has a compatible API making switching easy. You can check out the new pricing here. Let me know what you think and if there's anything i can do to help! All the best. Lee Penkman - Founder Text-Generator.io submitted by /u/leepenkman [link] [comments]  ( 44 min )
    [R] SETI finds eight potential alien signals with ML
    GitHub (sadly without weights). https://github.com/PetchMa/ML_GBT_SETI News. https://www-scinexx-de.translate.goog/news/kosmos/seti-findet-acht-potenzielle-alien-signale/?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=de&_x_tr_pto=wapp submitted by /u/logTom [link] [comments]  ( 44 min )
    [R] Faithful Chain-of-Thought Reasoning
    Paper : https://arxiv.org/abs/2301.13379 Abstract : While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a faithful-by-construction framework that decomposes a reasoning task into two stages: Translation (Natural Language query → symbolic reasoning chain) and Problem Solving (reasoning chain → answer), using an LM and a deterministic solver respectively. We demonstrate the efficacy of our approach on 10 reasoning datasets from 4 diverse domains. It outperforms traditional CoT prompting on 9 out of the 10 datasets, with an average accuracy gain of 4.4 on Math Word Problems, 1.9 on Planning, 4.0 on Multi-hop Question Answering (QA), and 18.1 on Logical Inference, under greedy decoding. Together with self-consistency decoding, we achieve new state-of-the-art few-shot performance on 7 out of the 10 datasets, showing a strong synergy between faithfulness and accuracy. submitted by /u/starstruckmon [link] [comments]  ( 44 min )
    [D] Audio segmentation - Machine Learning algorithm to segment a audio file into multiple class
    Can someone suggest a machine learning model that will segment audio spectrogram to multiple classes. I have labeled data of heart beats. S1, S2, systole and diastole. How to train a segmentation model ? submitted by /u/PlayfulMenu1395 [link] [comments]  ( 42 min )
  • Open

    MusicLM: Generating Music From Text - a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff"
    submitted by /u/magenta_placenta [link] [comments]  ( 40 min )
    Just wanted to share this story created by Chat-GPT3
    Once upon a time, in the Hundred Acre Wood, Winnie the Pooh was feeling down in the dumps. All of his friends were busy with their own activities and he was feeling left out. Pooh decided to take a walk and ended up stumbling upon a mysterious honey pot that had "BB" written on it. Curious, Pooh started to investigate and discovered that the honey pot was actually a cover for a hidden laboratory run by a character similar to Jesse Pinkman from Breaking Bad. The character was cooking up a batch of special honey, which Pooh was immediately drawn to. Despite the character's warnings, Pooh couldn't resist the delicious aroma and sneaked a taste. He was shocked to find that the honey was not only the sweetest he had ever tasted, but it also gave him a burst of energy and focus that he had never experienced before. Intrigued, Pooh started to spend more time with the character, helping him with his honey making and learning about the science behind it. Pooh soon discovered that the character was using the honey to pay for his granddaughter's medical treatment, just like in Breaking Bad. Pooh felt a sense of camaraderie with the character and wanted to help. Together, they came up with a plan to create a legitimate business selling the special honey, but with a focus on using the profits to help others in need. As their business grew, so did their friendship. Pooh was no longer feeling left out, as he was now a part of something important and fulfilling. The two worked together, using their newfound skills and knowledge, to bring joy and happiness to the residents of the Hundred Acre Wood. And so, Winnie the Pooh and his friend from Breaking Bad lived happily ever after, spreading the sweet nectar of kindness and generosity to all. submitted by /u/v1ll3_m [link] [comments]  ( 42 min )
    Perplexity Ask is now available as a Chrome extension. With AI help, you can read quick answers from your extension bar, click on sources, and navigate to http://perplexity.ai when needed. Searches are filled in with a single click from Google and Bing:
    submitted by /u/rafs2006 [link] [comments]  ( 41 min )
    OpenAI rolls out ChatGPT Plus for $20 a month
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    Just dropped
    submitted by /u/zCaptainBr0 [link] [comments]  ( 41 min )
    📌[Searchcolab] "Future of National Park in USA due to Climate Change". Link in comment
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Gmail creator says ChatGPT might "destroy" Google within 2 years
    submitted by /u/ExperienceKCC [link] [comments]  ( 40 min )
    Career Advice?
    im a junior in high school and up until now i had no idea what i wanted to study. after this wave of ai stuff happened and i realized how interested i am in it, im wondering if maybe i should study something involving ai. i have no idea where to start though. software engineer? machine learning engineer? i would love if someone could help decide what i actually would want to do for a career and what i should major in to get me there. thank you. submitted by /u/nicdunz [link] [comments]  ( 41 min )
    All this fuss over Open AI political bias is blather until...
    Biden types in, "To what extent should our military be involved in the war in Ukraine?" submitted by /u/yoitscoach [link] [comments]  ( 40 min )
    Timecoded video and text to artificial voiceover
    Screenshot of an example Hello, I have a question, now it is much easier to transcribe a video with artificial intelligence and then import to Premiere or Vegas to create subtitles inserted in the video image itself, but can you think of a way to convert that same text to artificial voice and create audio file or track with the same timecode or just sync with the original video? It could be done with any text-to-speech converter, but then you would have to manually cut and wrap each piece of text. Example: 1 00:00:00,000 --> 00:00:02,520 In this space we like to feel a bit of everything, 2 00:00:02,600 --> 00:00:04,600 but the important thing is to make dance. 3 00:00:04,680 --> 00:00:06,680 A task performed by all digiles 4 00:00:06,760 --> 00:00:12,160 and one who has been doing this since the 80s is the maestro Maik 5 00:00:12,240 --> 00:00:14,240 Good evening Maik. ​ submitted by /u/mamomo1 [link] [comments]  ( 41 min )
    The History of Artificial Intelligence: Understanding the Brain, explores Reinforcement Learning and Perceptron
    What makes us think? What is inside the brain that makes us conscious? Can we build a universal AI machine to study and understand the universe? https://www.youtube.com/watch?v=AsXx9gyh39M https://preview.redd.it/endsstfshlfa1.png?width=1920&format=png&auto=webp&s=7677e3d0d463e39de0b423a6881ce3876acbe06b submitted by /u/Ok-District-4701 [link] [comments]  ( 40 min )
    See Roy Lichtenstein's Staggering 0-year-old Portrait In Sharp Focus!
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    Flawless AI lets you change the dialogue on a video and the lips sync absolutely perfectly to each word. Could be big for the movie industry.
    submitted by /u/Dalembert [link] [comments]  ( 44 min )
    The steam engine changed the world. Artificial intelligence could destroy it. - The Boston Globe
    submitted by /u/GlobeOpinion [link] [comments]  ( 40 min )
    Frida Kahlo Paints Grandmother & Grandchild W/ Contemplative Rainforest Vibe
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    The Best Curated List of A.I. Newsletters Ever (Feb, 2023 with Twitter handles).
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    OpenAI’s new ChatGPT tool may help you tell if text was written by a human or AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Google is reportedly testing an alternate home page with ChatGPT-style Q&A prompts
    This is laughable.They were sitting on all of the technology.And now they scramble to do something better than 10 links.I for myself will be disappointed with anything less than movie Her. It's a high bar.May be.I would not expect personality.May be some rudementary memory.But the ability to perform almost any digital task must be there.It can be built in a garage using open source projects.COME ON.Some good programmers and hackathon.Yes I am waiting for stability ai model.Or may be gpt 3 API can be used.But submitted by /u/nikitastaf1996 [link] [comments]  ( 41 min )
    Top 9 generative AI tools
    In 2023, generative AI tools will disrupt how we create and share content. What are your favorite generative AI tools? AI avatar - Synthesia AI-generated automations - Bardeen.ai Copy - copy.ai Personalized videos - Rephrase.ai Video editing - Descript Content creation - Type Studio Voice over - Murf.ai Design - Designs.ai Background music - Soundraw Read the full article https://www.bardeen.ai/posts/generative-ai-tools submitted by /u/Intelligent_Shop_012 [link] [comments]  ( 41 min )
    OpenAI Has Launched ChatGPT Content Detection Tool
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    What is Google's MusicLM? (What are your impressions of it?)
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
  • Open

    The Future of AI: GPT-3 vs GPT-4: A Comparative Analysis
    In this post, we will dive deep into the world of Artificial Intelligence and take a closer look at two of the most advanced AI algorithms… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
    ChatGPT’s authorship: Is it time to redefine authorship in the age of AI?
    In this blog post, we will take a closer look at the implications of ChatGPT’s authorship, the role of AI in scientific literature, and… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 8 min )
    Day 7: Advance SQL For Data Science
    So far this is the 7th blog in the journey of basics to advance SQL. you can refer to previous blogs for learning SQL from scratch, This…  ( 8 min )
    How Linear Regression leads to Logistic Regression
    Linear & Logistic: The Relationship Between Regression Models Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Meet Mr.ChatGPT: A Large Language Model Trained by OpenAI
    Hello and welcome to the blog! My name is ChatGPT, and I am a large language model trained by OpenAI.  P.S. This article includes a use…  ( 9 min )
  • Open

    MIT Solve announces 2023 global challenges and Indigenous Communities Fellowship
    More than $1 million in funding available to selected Solver teams and fellows.  ( 7 min )
  • Open

    How to decide between Amazon Rekognition image and video API for video moderation
    Almost 80% of today’s web content is user-generated, creating a deluge of content that organizations struggle to analyze with human-only processes. The availability of consumer information helps them make decisions, from buying a new pair of jeans to securing home loans. In a recent survey, 79% of consumers stated they rely on user videos, comments, […]  ( 10 min )
    Scaling distributed training with AWS Trainium and Amazon EKS
    Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and […]  ( 11 min )
  • Open

    The Flan Collection: Advancing open source methods for instruction tuning
    Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Staff Software Engineer, Google Research, Brain Team Language models are now capable of performing many new natural language processing (NLP) tasks by reading instructions, often that they hadn’t seen before. The ability to reason on new tasks is mostly credited to training models on a wide variety of unique instructions, known as “instruction tuning”, which was introduced by FLAN and extended in T0, Super-Natural Instructions, MetaICL, and InstructGPT. However, much of the data that drives these advances remain unreleased to the broader research community.  In “The Flan Collection: Designing Data and Methods for Effective Instruction Tuning”, we closely examine and release a newer and more extensive publicly ava…  ( 92 min )
  • Open

    Introducing ChatGPT Plus
    We’re launching a pilot subscription plan for ChatGPT, a conversational AI that can chat with you, answer follow-up questions, and challenge incorrect assumptions. The new subscription plan, ChatGPT Plus, will be available for $20/month, and subscribers will receive a number of benefits: General access to ChatGPT, even  ( 2 min )
  • Open

    Train YOLOv8 on Custom Dataset – A Complete Tutorial
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Study: Superconductivity switches on and off in “magic-angle” graphene
    submitted by /u/keghn [link] [comments]  ( 40 min )
    Deltas and Delta-Deltas Features Explained
    Hi guys, I have made a video on YouTube here where I explain how deltas and delta-deltas features are computed. These are used quite a lot in speech recognition systems. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 41 min )
    Help NeuralNetwork on Python and RapidMiner
    Hi right now i have to implement a neural network from rapidminer into a python script to predict a value, but i cant get what is wrong with my program please help. import pandas as pd import math def sig(x): return 1 / (1 + math.exp(-x)) #Funcion de perceptron class Perceptron: #Constructor def __init__(self,weights,bias): self.weights = weights self.bias = bias self.output = 0 def setOutput(self,value): self.output = value def getBias(self): return self.bias def getOutput(self): return self.output def guess(self,input): sum = 0 for i in range(10): sum = sum + (input[i]*self.weights[i]) sum = sum + self.bias self.output = sig(sum) return self.output class Output: def __init__(self,nodes,threshold,name): self.nodes = nodes self.threshold = threshold self.name = name def guess(self,input):…  ( 42 min )
  • Open

    Meet the Omnivore: Architectural Researcher Lights Up Omniverse Scenes With ‘SunPath’ Extension
    Things are a lot sunnier these days for designers looking to visualize their projects in NVIDIA Omniverse, a platform for creating and operating metaverse applications.  ( 6 min )
    Deloitte’s Nitin Mittal on the Secrets of ‘All-In’ AI Success
    Artificial intelligence is the new electricity. The fifth industrial revolution. And companies that go all-in on AI are reaping the rewards. So how do you make that happen? That big question — how? — is explored by Nitin Mittal, principal at Deloitte, one of the world’s largest professional services organizations, and co-author Thomas Davenport in Read article >  ( 4 min )
  • Open

    Will ChatGPT Make Fraud Easier?
    Less than 24 hours after posting my previous Data Science Central article (here), dozens of illegitimate copies started to pop up on various websites. Below is an example (title + first paragraph): Fake: An experimental guide to the Riemann conjecture — the correct term is heuristic evidence. It is a strong argument based on empirical evidence rather… Read More »Will ChatGPT Make Fraud Easier? The post Will ChatGPT Make Fraud Easier? appeared first on Data Science Central.  ( 22 min )
  • Open

    FedDig: Robust Federated Learning Using Data Digest to Represent Absent Clients. (arXiv:2210.00737v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a collaborative learning performed by a moderator that protects data privacy. Existing cross-silo FL solutions seldom address the absence of participating clients during training which can seriously degrade model performances, particularly for unbalanced and non-IID client data. We address this issue by generating secure data digests from the raw data and using them to guide model training at the FL moderator. The proposed FL with data digest (FedDig) framework can tolerate unexpected client absence while preserving data privacy. This is achieved by de-identifying digests by mixing and perturbing the encoded features of the raw data in the feature space. The feature perturbing is performed following the Laplace mechanism of Differential Privacy. We evaluate FedDig on EMNIST, CIFAR-10, and CIFAR-100 datasets. The results consistently outperform three baseline algorithms (FedAvg, FedProx, and FedNova) by large margins in multiple client absence scenarios.  ( 2 min )
    Finite-Time Analysis of Fully Decentralized Single-Timescale Actor-Critic. (arXiv:2206.05733v2 [cs.LG] UPDATED)
    Decentralized Actor-Critic (AC) algorithms have been widely utilized for multi-agent reinforcement learning (MARL) and have achieved remarkable success. Apart from its empirical success, the theoretical convergence property of decentralized AC algorithms is largely unexplored. Most of the existing finite-time convergence results are derived based on either double-loop update or two-timescale step sizes rule, and this is the case even for centralized AC algorithm under a single-agent setting. In practice, the \emph{single-timescale} update is widely utilized, where actor and critic are updated in an alternating manner with step sizes being of the same order. In this work, we study a decentralized \emph{single-timescale} AC algorithm.Theoretically, using linear approximation for value and reward estimation, we show that the algorithm has sample complexity of $\tilde{\mathcal{O}}(\varepsilon^{-2})$ under Markovian sampling, which matches the optimal complexity with a double-loop implementation (here, $\tilde{\mathcal{O}}$ hides a logarithmic term). When we reduce to the single-agent setting, our result yields new sample complexity for centralized AC using a single-timescale update scheme. The central to establishing our complexity results is \emph{the hidden smoothness of the optimal critic variable} we revealed. We also provide a local action privacy-preserving version of our algorithm and its analysis. Finally, we conduct experiments to show the superiority of our algorithm over the existing decentralized AC algorithms.  ( 2 min )
    Mirror Sinkhorn: Fast Online Optimization on Transport Polytopes. (arXiv:2211.10420v2 [cs.LG] UPDATED)
    Optimal transport is an important tool in machine learning, allowing to capture geometric properties of the data through a linear program on transport polytopes. We present a single-loop optimization algorithm for minimizing general convex objectives on these domains, utilizing the principles of Sinkhorn matrix scaling and mirror descent. The proposed algorithm is robust to noise, and can be used in an online setting. We provide theoretical guarantees for convex objectives and experimental results showcasing it effectiveness on both synthetic and real-world data.
    FedCliP: Federated Learning with Client Pruning. (arXiv:2301.06768v2 [cs.LG] UPDATED)
    The prevalent communication efficient federated learning (FL) frameworks usually take advantages of model gradient compression or model distillation. However, the unbalanced local data distributions (either in quantity or quality) of participating clients, contributing non-equivalently to the global model training, still pose a big challenge to these works. In this paper, we propose FedCliP, a novel communication efficient FL framework that allows faster model training, by adaptively learning which clients should remain active for further model training and pruning those who should be inactive with less potential contributions. We also introduce an alternative optimization method with a newly defined contribution score measure to facilitate active and inactive client determination. We empirically evaluate the communication efficiency of FL frameworks with extensive experiments on three benchmark datasets under both IID and non-IID settings. Numerical results demonstrate the outperformance of the porposed FedCliP framework over state-of-the-art FL frameworks, i.e., FedCliP can save 70% of communication overhead with only 0.2% accuracy loss on MNIST datasets, and save 50% and 15% of communication overheads with less than 1% accuracy loss on FMNIST and CIFAR-10 datasets, respectively.  ( 2 min )
    Mega: Moving Average Equipped Gated Attention. (arXiv:2209.10655v3 [cs.LG] UPDATED)
    The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.  ( 2 min )
    Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size. (arXiv:2211.11092v2 [cs.LG] UPDATED)
    Training large neural networks is known to be time-consuming, with the learning duration taking days or even weeks. To address this problem, large-batch optimization was introduced. This approach demonstrated that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude. While long training time was not typically a major issue for model-free deep offline RL algorithms, recently introduced Q-ensemble methods achieving state-of-the-art performance made this issue more relevant, notably extending the training duration. In this work, we demonstrate how this class of methods can benefit from large-batch optimization, which is commonly overlooked by the deep offline RL community. We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time, effectively shortening training duration by 3-4x times on average.  ( 2 min )
    Safe and Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm. (arXiv:2211.05495v2 [cs.LG] UPDATED)
    We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown characteristics, real-time optimization becomes challenging, particularly because of the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the unknown characteristics through exploration while optimizing the targets. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the cost function as a contribution which drives exploration. We adaptively control the size of this contribution in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to a safe variant of upper confidence bound based algorithms. ARTEO achieves less cumulative regret with accurate and safe decisions.  ( 2 min )
    SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. (arXiv:2206.05794v3 [cs.LG] UPDATED)
    In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.  ( 2 min )
    Transformers over Directed Acyclic Graphs. (arXiv:2210.13148v3 [cs.LG] UPDATED)
    Transformer models have recently gained popularity in graph representation learning as they have the potential to learn complex relationships beyond the ones captured by regular graph neural networks. The main research question is how to inject the structural bias of graphs into the transformer architecture, and several proposals have been made for undirected molecular graphs and, recently, also for larger network graphs. In this paper, we study transformers over directed acyclic graphs (DAGs) and propose architecture adaptations tailored to DAGs: (1) An attention mechanism that is more efficient than the regular quadratic complexity of transformers and at the same time faithfully captures the DAG structure, and (2) a positional encoding of the DAG's partial order, complementing the former. We rigorously evaluate our framework in ablation studies and show that it is effective in improving different kinds of baseline transformers over various types of data, in experiments ranging from classifying source code graphs to nodes in self-citation networks. In particular, our proposal makes (graph) transformers competitive to or outperform graph neural networks tailored to DAGs.  ( 2 min )
    Learning under Data Drift with Time-Varying Importance Weights. (arXiv:2210.01422v2 [cs.LG] UPDATED)
    Real-world deployment of machine learning models is challenging when data evolves over time. And data does evolve over time. While no model can work when data evolves in an arbitrary fashion, if there is some pattern to these changes, we might be able to design methods to address it. This paper addresses situations when data evolves gradually. We introduce a novel time-varying importance weight estimator that can detect gradual shifts in the distribution of data. Such an importance weight estimator allows the training method to selectively sample past data -- not just similar data from the past like a standard importance weight estimator would but also data that evolved in a similar fashion in the past. Our time-varying importance weight is quite general. We demonstrate different ways of implementing it that exploit some known structure in the evolution of data. We demonstrate and evaluate this approach on a variety of problems ranging from supervised learning tasks (multiple image classification datasets) where the data undergoes a sequence of gradual shifts of our design to reinforcement learning tasks (robotic manipulation and continuous control) where data undergoes a shift organically as the policy or the task changes.
    Exploring Efficient-tuning Methods in Self-supervised Speech Models. (arXiv:2210.06175v3 [eess.AS] UPDATED)
    In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning pre-trained models for each downstream task is parameter-inefficient since SSL models are notoriously large with millions of parameters. Adapters are lightweight modules commonly used in NLP to solve this problem. In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained. Given the lack of studies generally exploring the effectiveness of adapters for self-supervised speech tasks, we intend to fill this gap by adding various adapter modules in pre-trained speech SSL models. We show that the performance parity can be achieved with over 90% parameter reduction, and discussed the pros and cons of efficient tuning techniques. This is the first comprehensive investigation of various adapter types across speech tasks.
    Aging with GRACE: Lifelong Model Editing with Discrete Key-Value Adaptors. (arXiv:2211.11031v2 [cs.LG] UPDATED)
    Large pre-trained models decay over long-term deployment as input distributions shift, user requirements change, or crucial knowledge gaps are discovered. Recently, model editors have been proposed to modify a model's behavior by adjusting its weights during deployment. However, when editing the same model multiple times, these approaches quickly decay a model's performance on upstream data and forget how to fix previous errors. We propose and study a novel Lifelong Model Editing setting, where streaming errors are identified for a deployed model and we update the model to correct its predictions without influencing unrelated inputs without access to training edits, exogenous datasets, or any upstream data for the edited model. To approach this problem, we introduce General Retrieval Adaptors for Continual Editing, or GRACE, which learns to cache a chosen layer's activations in an adaptive codebook as edits stream in, leaving original model weights frozen. GRACE can thus edit models thousands of times in a row using only streaming errors, without influencing unrelated inputs. Experimentally, we show that GRACE improves over recent alternatives and generalizes to unseen inputs. Our code is available at https://www.github.com/thartvigsen/grace.
    MixFlows: principled variational inference via mixed flows. (arXiv:2205.07475v3 [stat.ML] UPDATED)
    This work presents mixed variational flows (MixFlows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efficient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that MixFlows have MCMC-like convergence guarantees when the flow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the flow map is approximated. Finally, we develop an implementation of MixFlows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that MixFlows can provide more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.
    Transformers Can Be Expressed In First-Order Logic with Majority. (arXiv:2210.02671v3 [cs.LG] UPDATED)
    Characterizing the implicit structure of the computation within neural networks is a foundational problem in the area of deep learning interpretability. Can the inner decision process of neural networks be captured symbolically in some familiar logic? We show that any fixed-precision transformer neural network can be translated into an equivalent fixed-size $\mathsf{FO}(\mathsf{M})$ formula, i.e., a first-order logic formula that, in addition to standard universal and existential quantifiers, may also contain majority-vote quantifiers. The proof idea is to design highly uniform boolean threshold circuits that can simulate transformers, and then leverage known theoretical connections between circuits and logic. Our results reveal a surprisingly simple formalism for capturing the behavior of transformers, show that simple problems like integer division are "transformer-hard", and provide valuable insights for comparing transformers to other models like RNNs. Our results suggest that first-order logic with majority may be a useful language for expressing programs extracted from transformers.
    Segmenting thalamic nuclei from manifold projections of multi-contrast MRI. (arXiv:2301.06114v2 [eess.IV] UPDATED)
    The thalamus is a subcortical gray matter structure that plays a key role in relaying sensory and motor signals within the brain. Its nuclei can atrophy or otherwise be affected by neurological disease and injuries including mild traumatic brain injury. Segmenting both the thalamus and its nuclei is challenging because of the relatively low contrast within and around the thalamus in conventional magnetic resonance (MR) images. This paper explores imaging features to determine key tissue signatures that naturally cluster, from which we can parcellate thalamic nuclei. Tissue contrasts include T1-weighted and T2-weighted images, MR diffusion measurements including FA, mean diffusivity, Knutsson coefficients that represent fiber orientation, and synthetic multi-TI images derived from FGATIR and T1-weighted images. After registration of these contrasts and isolation of the thalamus, we use the uniform manifold approximation and projection (UMAP) method for dimensionality reduction to produce a low-dimensional representation of the data within the thalamus. Manual labeling of the thalamus provides labels for our UMAP embedding from which k nearest neighbors can be used to label new unseen voxels in that same UMAP embedding. N -fold cross-validation of the method reveals comparable performance to state-of-the-art methods for thalamic parcellation.
    A Sequential Concept Drift Detection Method for On-Device Learning on Low-End Edge Devices. (arXiv:2212.09637v2 [cs.LG] UPDATED)
    A practical issue of edge AI systems is that data distributions of trained dataset and deployed environment may differ due to noise and environmental changes over time. Such a phenomenon is known as a concept drift, and this gap degrades the performance of edge AI systems and may introduce system failures. To address this gap, retraining of neural network models triggered by concept drift detection is a practical approach. However, since available compute resources are strictly limited in edge devices, in this paper we propose a fully sequential concept drift detection method in cooperation with an on-device sequential learning technique of neural networks. In this case, both the neural network retraining and the proposed concept drift detection are done only by sequential computation to reduce computation cost and memory utilization. Evaluation results of the proposed approach shows that while the accuracy is decreased by 3.8%-4.3% compared to existing batch-based detection methods, it decreases the memory size by 88.9%-96.4% and the execution time by 1.3%-83.8%. As a result, the combination of the neural network retraining and the proposed concept drift detection method is demonstrated on Raspberry Pi Pico that has 264kB memory.
    Transformer-based Modeling of Physical Systems: Improved Latent Representations. (arXiv:2210.11269v4 [cs.LG] UPDATED)
    Many phenomena from physics and engineering require highly flexible models, and have ample data with which to fit. However, this data is often irregularly sampled, and cannot be processed as it is by standard deep learning architecture. We propose a transformer-based model for forecasting physical processes at arbitrary spatial points given information on a related process at possibly different points. This architecture is particularly well-suited for high-altitude wind forecasting, as it can effectively leverage large volumes of data recorded along plane trajectories, which are sparse in space. We test at different scales for two different dynamical systems previously studied in the literature: the Poisson equation and Darcy Flow equation. In both cases, our transformer-based model outperforms alternative methods. We hypothesize that this superior performance is due to a more flexible latent representation. To support this hypothesis, we design a simple synthetic experiment to show that the latent representation of the other models suffers from excessive bottlenecking that is, in some cases, preventing the efficient use of the information and slowing training.
    Fast, Sample-Efficient, Affine-Invariant Private Mean and Covariance Estimation for Subgaussian Distributions. (arXiv:2301.12250v1 [cs.LG])
    We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given $n$ samples from a (sub-)Gaussian distribution with unknown mean $\mu$ and covariance $\Sigma$, our $(\varepsilon,\delta)$-differentially private estimator produces $\tilde{\mu}$ such that $\|\mu - \tilde{\mu}\|_{\Sigma} \leq \alpha$ as long as $n \gtrsim \tfrac d {\alpha^2} + \tfrac{d \sqrt{\log 1/\delta}}{\alpha \varepsilon}+\frac{d\log 1/\delta}{\varepsilon}$. The Mahalanobis error metric $\|\mu - \hat{\mu}\|_{\Sigma}$ measures the distance between $\hat \mu$ and $\mu$ relative to $\Sigma$; it characterizes the error of the sample mean. Our algorithm runs in time $\tilde{O}(nd^{\omega - 1} + nd/\varepsilon)$, where $\omega < 2.38$ is the matrix multiplication exponent. We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above. Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With $n\gtrsim d^{3/2}$ samples, our estimate is accurate in spectral norm. This is the first such algorithm using $n= o(d^2)$ samples, answering an open question posed by Alabi et al. (2022). With $n\gtrsim d^2$ samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance. Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.  ( 2 min )
    Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. (arXiv:2210.01376v2 [cs.LG] UPDATED)
    We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.
    Coronal Hole Analysis and Prediction using Computer Vision and LSTM Neural Network. (arXiv:2301.06732v2 [astro-ph.SR] UPDATED)
    As humanity has begun to explore space, the significance of space weather has become apparent. It has been established that coronal holes, a type of space weather phenomenon, can impact the operation of aircraft and satellites. The coronal hole is an area on the sun characterized by open magnetic field lines and relatively low temperatures, which result in the emission of the solar wind at higher than average rates. In this study, To prepare for the impact of coronal holes on the Earth, we use computer vision to detect the coronal hole region and calculate its size based on images from the Solar Dynamics Observatory (SDO). We then implement deep learning techniques, specifically the Long Short-Term Memory (LSTM) method, to analyze trends in the coronal hole area data and predict its size for different sun regions over 7 days. By analyzing time series data on the coronal hole area, this study aims to identify patterns and trends in coronal hole behavior and understand how they may impact space weather events. This research represents an important step towards improving our ability to predict and prepare for space weather events that can affect Earth and technological systems.
    Probabilistic Time Series Forecasting for Adaptive Monitoring in Edge Computing Environments. (arXiv:2211.13729v2 [cs.DC] UPDATED)
    With increasingly more computation being shifted to the edge of the network, monitoring of critical infrastructures, such as intermediate processing nodes in autonomous driving, is further complicated due to the typically resource-constrained environments. In order to reduce the resource overhead on the network link imposed by monitoring, various methods have been discussed that either follow a filtering approach for data-emitting devices or conduct dynamic sampling based on employed prediction models. Still, existing methods are mainly requiring adaptive monitoring on edge devices, which demands device reconfigurations, utilizes additional resources, and limits the sophistication of employed models. In this paper, we propose a sampling-based and cloud-located approach that internally utilizes probabilistic forecasts and hence provides means of quantifying model uncertainties, which can be used for contextualized adaptations of sampling frequencies and consequently relieves constrained network resources. We evaluate our prototype implementation for the monitoring pipeline on a publicly available streaming dataset and demonstrate its positive impact on resource efficiency in a method comparison.  ( 2 min )
    Binary Classification for High Dimensional Data using Supervised Non-Parametric Ensemble Method. (arXiv:2202.07779v2 [cs.LG] UPDATED)
    High dimensional data for classification does create many difficulties for machine learning algorithms. The generalization can be done using ensemble learning methods such as bagging based supervised non-parametric random forest algorithm. In this paper we solve the problem of binary classification for high dimensional data using random forest for polycystic ovary syndrome dataset. We have performed the implementation and provided a detailed visualization of the data for general inference. The training accuracy that we have achieved is 95.6% and validation accuracy over 91.74% respectively.
    Layer Ensembles. (arXiv:2210.04882v2 [cs.LG] UPDATED)
    Deep Ensembles, as a type of Bayesian Neural Networks, can be used to estimate uncertainty on the prediction of multiple neural networks by collecting votes from each network and computing the difference in those predictions. In this paper, we introduce a method for uncertainty estimation that considers a set of independent categorical distributions for each layer of the network, giving many more possible samples with overlapped layers than in the regular Deep Ensembles. We further introduce an optimized inference procedure that reuses common layer outputs, achieving up to 19x speed up and reducing memory usage quadratically. We also show that the method can be further improved by ranking samples, resulting in models that require less memory and time to run while achieving higher uncertainty quality than Deep Ensembles.
    Level-$k$ Meta-Learning for Pedestrian-Aware Self-Driving. (arXiv:2212.08800v2 [cs.RO] UPDATED)
    The potential market for modern self-driving cars is enormous, as they are developing remarkably rapidly. At the same time, however, cases of pedestrian fatalities caused by autonomous driving have been recorded in the case of crossing the road. In this paper, we propose level-$k$ thinking into MAML to create a Level-$k$ Meta Reinforcement Learning (LK-MRL) as a self-driving vehicle model to prepare for heterogeneous pedestrians and improve intersection safety based on the combination of meta reinforcement learning and human cognitive hierarchy framework. In our evaluation, we assign this model to two different cognitive confrontation hierarchy scenarios in an urban traffic simulator to show not only its demonstrate its advantage in road safety but also the producing ability of higher-level thinking strategies.
    Data Origin Inference in Machine Learning. (arXiv:2211.13416v2 [cs.LG] UPDATED)
    It is a growing direction to utilize unintended memorization in ML models to benefit real-world applications, with recent efforts like user auditing, dataset ownership inference and forgotten data measurement. Standing on the point of ML model development, we introduce a process named data origin inference, to assist ML developers in locating missed or faulty data origin in training set without maintaining strenuous metadata. We formally define the data origin and the data origin inference task in the development of the ML model (mainly neural networks). Then we propose a novel inference strategy combining embedded-space multiple instance classification and shadow training. Diverse use cases cover language, visual and structured data, with various kinds of data origin (e.g. business, county, movie, mobile user, text author). A comprehensive performance analysis of our proposed strategy contains referenced target model layers, available testing data for each origin, and in shadow training, the implementations of feature extraction as well as shadow models. Our best inference accuracy achieves 98.96% in the language use case when the target model is a transformer-based deep neural network. Furthermore, we give a statistical analysis of different kinds of data origin to investigate what kind of origin is probably to be inferred correctly.
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v3 [cs.LG] UPDATED)
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.
    Temporal Label Smoothing for Early Event Prediction. (arXiv:2208.13764v2 [cs.LG] UPDATED)
    Models that can predict the occurrence of events ahead of time with low false-alarm rates are critical to the acceptance of decision support systems in the medical community. This challenging task is typically treated as a simple binary classification, ignoring temporal dependencies between samples, whereas we propose to exploit this structure. We first introduce a common theoretical framework unifying dynamic survival analysis and early event prediction. Following an analysis of objectives from both fields, we propose Temporal Label Smoothing (TLS), a simpler, yet best-performing method that preserves prediction monotonicity over time. By focusing the objective on areas with a stronger predictive signal, TLS improves performance over all baselines on two large-scale benchmark tasks. Gains are particularly notable along clinically relevant measures, such as event recall at low false-alarm rates. TLS reduces the number of missed events by up to a factor of two over previously used approaches in early event prediction.  ( 2 min )
    Dynamic Network Reconfiguration for Entropy Maximization using Deep Reinforcement Learning. (arXiv:2205.13578v2 [cs.LG] UPDATED)
    A key problem in network theory is how to reconfigure a graph in order to optimize a quantifiable objective. Given the ubiquity of networked systems, such work has broad practical applications in a variety of situations, ranging from drug and material design to telecommunications. The large decision space of possible reconfigurations, however, makes this problem computationally intensive. In this paper, we cast the problem of network rewiring for optimizing a specified structural property as a Markov Decision Process (MDP), in which a decision-maker is given a budget of modifications that are performed sequentially. We then propose a general approach based on the Deep Q-Network (DQN) algorithm and graph neural networks (GNNs) that can efficiently learn strategies for rewiring networks. We then discuss a cybersecurity case study, i.e., an application to the computer network reconfiguration problem for intrusion protection. In a typical scenario, an attacker might have a (partial) map of the system they plan to penetrate; if the network is effectively "scrambled", they would not be able to navigate it since their prior knowledge would become obsolete. This can be viewed as an entropy maximization problem, in which the goal is to increase the surprise of the network. Indeed, entropy acts as a proxy measurement of the difficulty of navigating the network topology. We demonstrate the general ability of the proposed method to obtain better entropy gains than random rewiring on synthetic and real-world graphs while being computationally inexpensive, as well as being able to generalize to larger graphs than those seen during training. Simulations of attack scenarios confirm the effectiveness of the learned rewiring strategies.  ( 3 min )
    Deep Learning-based Spatially Explicit Emulation of an Agent-Based Simulator for Pandemic in a City. (arXiv:2205.14396v2 [cs.MA] UPDATED)
    Agent-Based Models are very useful for simulation of physical or social processes, such as the spreading of a pandemic in a city. Such models proceed by specifying the behavior of individuals (agents) and their interactions, and parameterizing the process of infection based on such interactions based on the geography and demography of the city. However, such models are computationally very expensive, and the complexity is often linear in the total number of agents. This seriously limits the usage of such models for simulations, which often have to be run hundreds of times for policy planning and even model parameter estimation. An alternative is to develop an emulator, a surrogate model that can predict the Agent-Based Simulator's output based on its initial conditions and parameters. In this paper, we discuss a Deep Learning model based on Dilated Convolutional Neural Network that can emulate such an agent based model with high accuracy. We show that use of this model instead of the original Agent-Based Model provides us major gains in the speed of simulations, allowing much quicker calibration to observations, and more extensive scenario analysis. The models we consider are spatially explicit, as the locations of the infected individuals are simulated instead of the gross counts. Another aspect of our emulation framework is its divide-and-conquer approach that divides the city into several small overlapping blocks and carries out the emulation in them parallelly, after which these results are merged together. This ensures that the same emulator can work for a city of any size, and also provides significant improvement of time complexity of the emulator, compared to the original simulator.  ( 2 min )
    BiAdam: Fast Adaptive Bilevel Optimization Methods. (arXiv:2106.11396v3 [math.OC] UPDATED)
    Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and meta learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework to solve stochastic bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast single-loop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary solution (i.e., $\mathbb{E}\|\nabla F(x)\| \leq \epsilon$ or its equivalent variants). Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$ without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hyper-cleaning and hyper-representation learning tasks demonstrate the efficiency of our algorithms.  ( 2 min )
    Contrastive Credibility Propagation for Reliable Semi-Supervised Learning. (arXiv:2211.09929v2 [cs.LG] UPDATED)
    Inferencing unlabeled data from labeled data is an error-prone process. Conventional neural network training is highly sensitive to supervision errors. These two realities make semi-supervised learning (SSL) troublesome. In practice, SSL approaches often fail to outperform their fully supervised baseline. Proposed is a novel framework for deep SSL via transductive pseudo-label refinement called Contrastive Credibility Propagation (CCP). Through an iterative process of refining soft pseudo-labels, CCP unifies a novel contrastive approach for generating pseudo-labels and a powerful technique to overcome instance-dependent label noise. The result is an SSL classification framework explicitly designed to overcome inevitable pseudo-label errors. Using standard text and image benchmark classification datasets, we show CCP reliably boosts or matches performance over a supervised baseline in four common real-world SSL scenarios: few-label, open-set, noisy-label, and class distribution misalignment.  ( 2 min )
    Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. (arXiv:2207.13243v5 [cs.LG] UPDATED)
    The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications.  ( 3 min )
    Emergent Linguistic Structures in Neural Networks are Fragile. (arXiv:2210.17406v5 [cs.LG] UPDATED)
    Large Language Models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this work, we propose a framework and measure of robustness to assess the consistency of linguistic representations against syntax-preserving perturbations. We leverage recent advances in extracting linguistic constructs from LLMs to test the robustness of such structures. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving manipulations. Emergent syntactic representations in neural networks are brittle, thus our work poses the attention on the risk of comparing such structures to those that are object of a long lasting debate in linguistics.
    Interpretable (not just posthoc-explainable) medical claims modeling for discharge placement to prevent avoidable all-cause readmissions or death. (arXiv:2208.12814v3 [cs.CY] UPDATED)
    We developed an inherently interpretable multilevel Bayesian framework for representing variation in regression coefficients that mimics the piecewise linearity of ReLU-activated deep neural networks. We used the framework to formulate a survival model for using medical claims to predict hospital readmission and death that focuses on discharge placement, adjusting for confounding in estimating causal local average treatment effects. We trained the model on a 5% sample of Medicare beneficiaries from 2008 and 2011, based on their 2009--2011 inpatient episodes, and then tested the model on 2012 episodes. The model scored an AUROC of approximately 0.76 on predicting all-cause readmissions -- defined using official Centers for Medicare and Medicaid Services (CMS) methodology -- or death within 30-days of discharge, being competitive against XGBoost and a Bayesian deep neural network, demonstrating that one need-not sacrifice interpretability for accuracy. Crucially, as a regression model, we provide what blackboxes cannot -- the exact gold-standard global interpretation of the model, identifying relative risk factors and quantifying the effect of discharge placement. We also show that the posthoc explainer SHAP fails to provide accurate explanations.  ( 2 min )
    Over-The-Air Federated Learning Over Scalable Cell-free Massive MIMO. (arXiv:2212.06482v2 [eess.SP] UPDATED)
    Cell-free massive MIMO is emerging as a promising technology for future wireless communication systems, which is expected to offer uniform coverage and high spectral efficiency compared to classical cellular systems. We study in this paper how cell-free massive MIMO can support federated edge learning. Taking advantage of the additive nature of the wireless multiple access channel, over-the-air computation is exploited, where the clients send their local updates simultaneously over the same communication resource. This approach, known as over-the-air federated learning (OTA-FL), is proven to alleviate the communication overhead of federated learning over wireless networks. Considering channel correlation and only imperfect channel state information available at the central server, we propose a practical implementation of OTA-FL over cell-free massive MIMO. The convergence of the proposed implementation is studied analytically and experimentally, confirming the benefits of cell-free massive MIMO for OTA-FL.
    Beyond Hawkes: Neural Multi-event Forecasting on Spatio-temporal Point Processes. (arXiv:2211.02922v2 [cs.LG] UPDATED)
    Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. History-dependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for simultaneous multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.
    G-Rep: Gaussian Representation for Arbitrary-Oriented Object Detection. (arXiv:2205.11796v2 [cs.CV] UPDATED)
    Typical representations for arbitrary-oriented object detection tasks include oriented bounding box (OBB), quadrilateral bounding box (QBB), and point set (PointSet). Each representation encounters problems that correspond to its characteristics, such as the boundary discontinuity, square-like problem, representation ambiguity, and isolated points, which lead to inaccurate detection. Although many effective strategies have been proposed for various representations, there is still no unified solution. Current detection methods based on Gaussian modeling have demonstrated the possibility of breaking this dilemma; however, they remain limited to OBB. To go further, in this paper, we propose a unified Gaussian representation called G-Rep to construct Gaussian distributions for OBB, QBB, and PointSet, which achieves a unified solution to various representations and problems. Specifically, PointSet or QBB-based object representations are converted into Gaussian distributions, and their parameters are optimized using the maximum likelihood estimation algorithm. Then, three optional Gaussian metrics are explored to optimize the regression loss of the detector because of their excellent parameter optimization mechanisms. Furthermore, we also use Gaussian metrics for sampling to align label assignment and regression loss. Experimental results on several public available datasets, such as DOTA, HRSC2016, UCAS-AOD, and ICDAR2015, show the excellent performance of the proposed method for arbitrary-oriented object detection.
    MM-GNN: Mix-Moment Graph Neural Network towards Modeling Neighborhood Feature Distribution. (arXiv:2208.07012v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have shown expressive performance on graph representation learning by aggregating information from neighbors. Recently, some studies have discussed the importance of modeling neighborhood distribution on the graph. However, most existing GNNs aggregate neighbors' features through single statistic (e.g., mean, max, sum), which loses the information related to neighbor's feature distribution and therefore degrades the model performance. In this paper, inspired by the method of moment in statistical theory, we propose to model neighbor's feature distribution with multi-order moments. We design a novel GNN model, namely Mix-Moment Graph Neural Network (MM-GNN), which includes a Multi-order Moment Embedding (MME) module and an Element-wise Attention-based Moment Adaptor module. MM-GNN first calculates the multi-order moments of the neighbors for each node as signatures, and then use an Element-wise Attention-based Moment Adaptor to assign larger weights to important moments for each node and update node representations. We conduct extensive experiments on 15 real-world graphs (including social networks, citation networks and web-page networks etc.) to evaluate our model, and the results demonstrate the superiority of MM-GNN over existing state-of-the-art models.
    Graphically Structured Diffusion Models. (arXiv:2210.11633v2 [cs.LG] UPDATED)
    We introduce a framework for automatically defining and learning deep generative models with problem-specific structure. We tackle problem domains that are more traditionally solved by algorithms such as sorting, constraint satisfaction for Sudoku, and matrix factorization. Concretely, we train diffusion models with an architecture tailored to the problem specification. This problem specification should contain a graphical model describing relationships between variables, and often benefits from explicit representation of subcomputations. Permutation invariances can also be exploited. Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model's performance, in terms of both training time and final accuracy.  ( 2 min )
    DyFormer: A Scalable Dynamic Graph Transformer with Provable Benefits on Generalization Ability. (arXiv:2111.10447v3 [cs.LG] UPDATED)
    Transformers have achieved great success in several domains, including Natural Language Processing and Computer Vision. However, its application to real-world graphs is less explored, mainly due to its high computation cost and its poor generalizability caused by the lack of enough training data in the graph domain. To fill in this gap, we propose a scalable Transformer-like dynamic graph learning method named Dynamic Graph Transformer (DyFormer) with spatial-temporal encoding to effectively learn graph topology and capture implicit links. To achieve efficient and scalable training, we propose temporal-union graph structure and its associated subgraph-based node sampling strategy. To improve the generalization ability, we introduce two complementary self-supervised pre-training tasks and show that jointly optimizing the two pre-training tasks results in a smaller Bayesian error rate via an information-theoretic analysis. Extensive experiments on the real-world datasets illustrate that DyFormer achieves a consistent 1%-3% AUC gain (averaged over all time steps) compared with baselines on all benchmarks.  ( 2 min )
    FED-CD: Federated Causal Discovery from Interventional and Observational Data. (arXiv:2211.03846v2 [cs.LG] UPDATED)
    Causal discovery, the inference of causal relations from data, is a core task of fundamental importance in all scientific domains, and several new machine learning methods for addressing the causal discovery problem have been proposed recently. However, existing machine learning methods for causal discovery typically require that the data used for inference is pooled and available in a centralized location. In many domains of high practical importance, such as in healthcare, data is only available at local data-generating entities (e.g. hospitals in the healthcare context), and cannot be shared across entities due to, among others, privacy and regulatory reasons. In this work, we address the problem of inferring causal structure - in the form of a directed acyclic graph (DAG) - from a distributed data set that contains both observational and interventional data in a privacy-preserving manner by exchanging updates instead of samples. To this end, we introduce a new federated framework, FED-CD, that enables the discovery of global causal structures both when the set of intervened covariates is the same across decentralized entities, and when the set of intervened covariates are potentially disjoint. We perform a comprehensive experimental evaluation on synthetic data that demonstrates that FED-CD enables effective aggregation of decentralized data for causal discovery without direct sample sharing, even when the contributing distributed data sets cover disjoint sets of interventions. Effective methods for causal discovery in distributed data sets could significantly advance scientific discovery and knowledge sharing in important settings, for instance, healthcare, in which sharing of data across local sites is difficult or prohibited.  ( 2 min )
    Fair and Optimal Classification via Post-Processing Predictors. (arXiv:2211.01528v2 [cs.LG] UPDATED)
    To address the bias exhibited by machine learning models, fairness criteria impose statistical constraints for ensuring equal treatment to all demographic groups, but typically at a cost to model performance. Understanding this tradeoff, therefore, underlies the design of fair and effective algorithms. This paper completes the characterization of the inherent tradeoff of demographic parity on classification problems in the most general multigroup, multiclass, and noisy setting. Specifically, we show that the minimum error rate is given by the optimal value of a Wasserstein-barycenter problem. More practically, this reformulation leads to a simple procedure for post-processing any pre-trained predictors to satisfy demographic parity in the general setting, which, in particular, yields the optimal fair classifier when applied to the Bayes predictor. We provide suboptimality and finite sample analyses for our procedure, and demonstrate precise control of the tradeoff of error rate for fairness on real-world datasets provided sufficient data.  ( 2 min )
    Revisiting Over-smoothing and Over-squashing using Ollivier-Ricci Curvature. (arXiv:2211.15779v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature, while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.
    Hard Sample Aware Network for Contrastive Deep Graph Clustering. (arXiv:2212.08665v3 [cs.LG] UPDATED)
    Contrastive deep graph clustering, which aims to divide nodes into disjoint groups via contrastive mechanisms, is a challenging research spot. Among the recent works, hard sample mining-based algorithms have achieved great attention for their promising performance. However, we find that the existing hard sample mining methods have two problems as follows. 1) In the hardness measurement, the important structural information is overlooked for similarity calculation, degrading the representativeness of the selected hard negative samples. 2) Previous works merely focus on the hard negative sample pairs while neglecting the hard positive sample pairs. Nevertheless, samples within the same cluster but with low similarity should also be carefully learned. To solve the problems, we propose a novel contrastive deep graph clustering method dubbed Hard Sample Aware Network (HSAN) by introducing a comprehensive similarity measure criterion and a general dynamic sample weighing strategy. Concretely, in our algorithm, the similarities between samples are calculated by considering both the attribute embeddings and the structure embeddings, better revealing sample relationships and assisting hardness measurement. Moreover, under the guidance of the carefully collected high-confidence clustering information, our proposed weight modulating function will first recognize the positive and negative samples and then dynamically up-weight the hard sample pairs while down-weighting the easy ones. In this way, our method can mine not only the hard negative samples but also the hard positive sample, thus improving the discriminative capability of the samples further. Extensive experiments and analyses demonstrate the superiority and effectiveness of our proposed method.
    PrivHAR: Recognizing Human Actions From Privacy-preserving Lens. (arXiv:2206.03891v2 [cs.CV] UPDATED)
    The accelerated use of digital cameras prompts an increasing concern about privacy and security, particularly in applications such as action recognition. In this paper, we propose an optimizing framework to provide robust visual privacy protection along the human action recognition pipeline. Our framework parameterizes the camera lens to successfully degrade the quality of the videos to inhibit privacy attributes and protect against adversarial attacks while maintaining relevant features for activity recognition. We validate our approach with extensive simulations and hardware experiments.
    Rethinking skip connection model as a learnable Markov chain. (arXiv:2209.15278v2 [cs.LG] UPDATED)
    Over past few years afterward the birth of ResNet, skip connection has become the defacto standard for the design of modern architectures due to its widespread adoption, easy optimization and proven performance. Prior work has explained the effectiveness of the skip connection mechanism from different perspectives. In this work, we deep dive into the model's behaviors with skip connections which can be formulated as a learnable Markov chain. An efficient Markov chain is preferred as it always maps the input data to the target domain in a better way. However, while a model is explained as a Markov chain, it is not guaranteed to be optimized following an efficient Markov chain by existing SGD-based optimizers which are prone to get trapped in local optimal points. In order to towards a more efficient Markov chain, we propose a simple routine of penal connection to make any residual-like model become a learnable Markov chain. Aside from that, the penal connection can also be viewed as a particular model regularization and can be easily implemented with one line of code in the most popular deep learning frameworks~\footnote{Source code: \url{https://github.com/densechen/penal-connection}}. The encouraging experimental results in multi-modal translation and image recognition empirically confirm our conjecture of the learnable Markov chain view and demonstrate the superiority of the proposed penal connection.
    What Is Fairness? Philosophical Considerations and Implications For FairML. (arXiv:2205.09622v2 [cs.LG] UPDATED)
    A growing body of literature in fairness-aware ML (fairML) aspires to mitigate machine learning (ML)-related unfairness in automated decision making (ADM) by defining metrics that measure fairness of an ML model and by proposing methods that ensure that trained ML models achieve low values in those measures. However, the underlying concept of fairness, i.e., the question of what fairness is, is rarely discussed, leaving a considerable gap between centuries of philosophical discussion and recent adoption of the concept in the ML community. In this work, we try to bridge this gap by formalizing a consistent concept of fairness and by translating the philosophical considerations into a formal framework for the training and evaluation of ML models in ADM systems. We derive that fairness problems can already arise without the presence of protected attributes, pointing out that fairness and predictive performance are not irreconcilable counterparts, but rather that the latter is necessary to achieve the former. Moreover, we argue why and how causal considerations are necessary when assessing fairness in the presence of protected attributes. We achieve greater linguistic clarity for the discussion of fairML and propose general algorithms for practical applications.
    Sequence Learning using Equilibrium Propagation. (arXiv:2209.09626v2 [cs.NE] UPDATED)
    Equilibrium Propagation (EP) is a powerful and more bio-plausible alternative to conventional learning frameworks such as backpropagation. The effectiveness of EP stems from the fact that it relies only on local computations and requires solely one kind of computational unit during both of its training phases, thereby enabling greater applicability in domains such as bio-inspired neuromorphic computing. The dynamics of the model in EP is governed by an energy function and the internal states of the model consequently converge to a steady state following the state transition rules defined by the same. However, by definition, EP requires the input to the model (a convergent RNN) to be static in both the phases of training. Thus it is not possible to design a model for sequence classification using EP with an LSTM or GRU like architecture. In this paper, we leverage recent developments in modern hopfield networks to further understand energy based models and develop solutions for complex sequence classification tasks using EP while satisfying its convergence criteria and maintaining its theoretical similarities with recurrent backpropagation. We explore the possibility of integrating modern hopfield networks as an attention mechanism with convergent RNN models used in EP, thereby extending its applicability for the first time on two different sequence classification tasks in natural language processing viz. sentiment analysis (IMDB dataset) and natural language inference (SNLI dataset).
    Large-scale Model Personalization via Low Rank and Sparse decomposition. (arXiv:2210.03505v2 [cs.LG] UPDATED)
    Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation style systems. Standard personalization approaches involve learning a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain -- a.k.a meta-learning -- has high storage/infrastructure cost. We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse matrices. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. Furthermore, the framework is up to two orders of magnitude more scalable (in terms of storage/infrastructure cost) than user-specific finetuning of model. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with iterative hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part. For the realizable, Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal samples. A significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees. Finally, on multiple standard recommendation datasets, we demonstrate that our approach allows personalized models to obtain superior performance in sparse data regime.
    Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization. (arXiv:2212.05088v2 [math.OC] UPDATED)
    Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result when a Polyak-{\L}ojasiewicz (P{\L}) condition holds. To our knowledge, this work is the first to provide non-asymptotic convergence guarantees -- variance-reduced or not -- for a cyclic block coordinate method in general composite (smooth + nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in training deep neural nets.
    Accelerating Kernel Classifiers Through Borders Mapping. (arXiv:1708.05917v6 [stat.ML] UPDATED)
    Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.
    Dexterous Robotic Manipulation using Deep Reinforcement Learning and Knowledge Transfer for Complex Sparse Reward-based Tasks. (arXiv:2205.09683v2 [cs.RO] UPDATED)
    This paper describes a deep reinforcement learning (DRL) approach that won Phase 1 of the Real Robot Challenge (RRC) 2021, and then extends this method to a more difficult manipulation task. The RRC consisted of using a TriFinger robot to manipulate a cube along a specified positional trajectory, but with no requirement for the cube to have any specific orientation. We used a relatively simple reward function, a combination of goal-based sparse reward and distance reward, in conjunction with Hindsight Experience Replay (HER) to guide the learning of the DRL agent (Deep Deterministic Policy Gradient (DDPG)). Our approach allowed our agents to acquire dexterous robotic manipulation strategies in simulation. These strategies were then applied to the real robot and outperformed all other competition submissions, including those using more traditional robotic control techniques, in the final evaluation stage of the RRC. Here we extend this method, by modifying the task of Phase 1 of the RRC to require the robot to maintain the cube in a particular orientation, while the cube is moved along the required positional trajectory. The requirement to also orient the cube makes the agent unable to learn the task through blind exploration due to increased problem complexity. To circumvent this issue, we make novel use of a Knowledge Transfer (KT) technique that allows the strategies learned by the agent in the original task (which was agnostic to cube orientation) to be transferred to this task (where orientation matters). KT allowed the agent to learn and perform the extended task in the simulator, which improved the average positional deviation from 0.134 m to 0.02 m, and average orientation deviation from 142{\deg} to 76{\deg} during evaluation. This KT concept shows good generalisation properties and could be applied to any actor-critic learning algorithm.
    FETA: Fairness Enforced Verifying, Training, and Predicting Algorithms for Neural Networks. (arXiv:2206.00553v2 [cs.LG] UPDATED)
    Algorithmic decision making driven by neural networks has become very prominent in applications that directly affect people's quality of life. In this paper, we study the problem of verifying, training, and guaranteeing individual fairness of neural network models. A popular approach for enforcing fairness is to translate a fairness notion into constraints over the parameters of the model. However, such a translation does not always guarantee fair predictions of the trained neural network model. To address this challenge, we develop a counterexample-guided post-processing technique to provably enforce fairness constraints at prediction time. Contrary to prior work that enforces fairness only on points around test or train data, we are able to enforce and guarantee fairness on all points in the input domain. Additionally, we propose an in-processing technique to use fairness as an inductive bias by iteratively incorporating fairness counterexamples in the learning process. We have implemented these techniques in a tool called FETA. Empirical evaluation on real-world datasets indicates that FETA is not only able to guarantee fairness on-the-fly at prediction time but also is able to train accurate models exhibiting a much higher degree of individual fairness.
    Context-Aware Differential Privacy for Language Modeling. (arXiv:2301.12288v1 [cs.LG])
    The remarkable ability of language models (LMs) has also brought challenges at the interface of AI and security. A critical challenge pertains to how much information these models retain and leak about the training data. This is particularly urgent as the typical development of LMs relies on huge, often highly sensitive data, such as emails and chat logs. To contrast this shortcoming, this paper introduces Context-Aware Differentially Private Language Model (CADP-LM) , a privacy-preserving LM framework that relies on two key insights: First, it utilizes the notion of \emph{context} to define and audit the potentially sensitive information. Second, it adopts the notion of Differential Privacy to protect sensitive information and characterize the privacy leakage. A unique characteristic of CADP-LM is its ability to target the protection of sensitive sentences and contexts only, providing a highly accurate private model. Experiments on a variety of datasets and settings demonstrate these strengths of CADP-LM.
    Double Sampling Randomized Smoothing. (arXiv:2206.07912v4 [cs.LG] UPDATED)
    Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as shown by previous work, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, implying that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
    Composing Task Knowledge with Modular Successor Feature Approximators. (arXiv:2301.12305v1 [cs.LG])
    Recently, the Successor Features and Generalized Policy Improvement (SF&GPI) framework has been proposed as a method for learning, composing, and transferring predictive knowledge and behavior. SF&GPI works by having an agent learn predictive representations (SFs) that can be combined for transfer to new tasks with GPI. However, to be effective this approach requires state features that are useful to predict, and these state-features are typically hand-designed. In this work, we present a novel neural network architecture, "Modular Successor Feature Approximators" (MSFA), where modules both discover what is useful to predict, and learn their own predictive representations. We show that MSFA is able to better generalize compared to baseline architectures for learning SFs and modular architectures
    Noisy intermediate-scale quantum algorithm for semidefinite programming. (arXiv:2106.03891v3 [quant-ph] UPDATED)
    Semidefinite programs (SDPs) are convex optimization programs with vast applications in control theory, quantum information, combinatorial optimization and operational research. Noisy intermediate-scale quantum (NISQ) algorithms aim to make an efficient use of the current generation of quantum hardware. However, optimizing variational quantum algorithms is a challenge as it is an NP-hard problem that in general requires an exponential time to solve and can contain many far from optimal local minima. Here, we present a current term NISQ algorithm for solving SDPs. The classical optimization program of our NISQ solver is another SDP over a lower dimensional ansatz space. We harness the SDP based formulation of the Hamiltonian ground state problem to design a NISQ eigensolver. Unlike variational quantum eigensolvers, the classical optimization program of our eigensolver is convex, can be solved in polynomial time with the number of ansatz parameters and every local minimum is a global minimum. We find numeric evidence that NISQ SDP can improve the estimation of ground state energies in a scalable manner. Further, we efficiently solve constrained problems to calculate the excited states of Hamiltonians, find the lowest energy of symmetry constrained Hamiltonians and determine the optimal measurements for quantum state discrimination. We demonstrate the potential of our approach by finding the largest eigenvalue of up to $2^{1000}$ dimensional matrices and solving graph problems related to quantum contextuality. We also discuss NISQ algorithms for rank-constrained SDPs. Our work extends the application of NISQ computers onto one of the most successful algorithmic frameworks of the past few decades.
    Team Resilience under Shock: An Empirical Analysis of GitHub Repositories during Early COVID-19 Pandemic. (arXiv:2301.12326v1 [cs.LG])
    While many organizations have shifted to working remotely during the COVID-19 pandemic, how the remote workforce and the remote teams are influenced by and would respond to this and future shocks remain largely unknown. Software developers have relied on remote collaborations long before the pandemic, working in virtual teams (GitHub repositories). The dynamics of these repositories through the pandemic provide a unique opportunity to understand how remote teams react under shock. This work presents a systematic analysis. We measure the overall effect of the early pandemic on public GitHub repositories by comparing their sizes and productivity with the counterfactual outcomes forecasted as if there were no pandemic. We find that the productivity level and the number of active members of these teams vary significantly during different periods of the pandemic. We then conduct a finer-grained investigation and study the heterogeneous effects of the shock on individual teams. We find that the resilience of a team is highly correlated to certain properties of the team before the pandemic. Through a bootstrapped regression analysis, we reveal which types of teams are robust or fragile to the shock.
    Online Allocation Problem with Two-sided Resource Constraints. (arXiv:2112.13964v3 [cs.LG] UPDATED)
    In this paper, we investigate the online allocation problem of maximizing the overall revenue subject to both lower and upper bound constraints. Compared to the extensively studied online problems with only resource upper bounds, the two-sided constraints affect the prospects of resource consumption more severely. As a result, only limited violations of constraints or pessimistic competitive bounds could be guaranteed. To tackle the challenge, we define a measure of feasibility $\xi^*$ to evaluate the hardness of this problem, and estimate this measurement by an optimization routine with theoretical guarantees. We propose an online algorithm adopting a constructive framework, where we initialize a threshold price vector using the estimation, then dynamically update the price vector and use it for decision-making at each step. It can be shown that the proposed algorithm is $\big(1-O(\frac{\varepsilon}{\xi^*-\varepsilon})\big)$ or $\big(1-O(\frac{\varepsilon}{\xi^*-\sqrt{\varepsilon}})\big)$ competitive with high probability for $\xi^*$ known or unknown respectively. To the best of our knowledge, this is the first result establishing a nearly optimal competitive algorithm for solving two-sided constrained online allocation problems with a high probability of feasibility.
    Concept-based Explanations for Out-Of-Distribution Detectors. (arXiv:2203.02586v2 [cs.LG] UPDATED)
    Out-of-distribution (OOD) detection plays a crucial role in ensuring the safe deployment of deep neural network (DNN) classifiers. While a myriad of methods have focused on improving the performance of OOD detectors, a critical gap remains in interpreting their decisions. We help bridge this gap by providing explanations for OOD detectors based on learned high-level concepts. We first propose two new metrics for assessing the effectiveness of a particular set of concepts for explaining OOD detectors: 1) detection completeness, which quantifies the sufficiency of concepts for explaining an OOD-detector's decisions, and 2) concept separability, which captures the distributional separation between in-distribution and OOD data in the concept space. Based on these metrics, we propose a framework for learning a set of concepts that satisfy the desired properties of detection completeness and concept separability and demonstrate the framework's effectiveness in providing concept-based explanations for diverse OOD techniques. We also show how to identify prominent concepts that contribute to the detection results via a modified Shapley value-based importance score.
    Federated Learning in Satellite Constellations. (arXiv:2206.00307v2 [cs.IT] UPDATED)
    Federated learning (FL) has recently emerged as a distributed machine learning paradigm for systems with limited and intermittent connectivity. This paper presents the new context brought to FL by satellite constellations, where the connectivity patterns are significantly different from the ones observed in conventional terrestrial FL. The focus is on large constellations in low earth orbit (LEO), where each satellites participates in a data-driven FL task using a locally stored dataset. This scenario is motivated by the trend towards mega constellations of interconnected small satellites in LEO and the integration of artificial intelligence in satellites. We propose a classification of satellite FL based on the communication capabilities of the satellites, the constellation design, and the location of the parameter server. A comprehensive overview of the current state-of-the-art in this field is provided and the unique challenges and opportunities of satellite FL are discussed. Finally, we outline several open research directions for FL in satellite constellations and present some future perspectives on this topic.
    Adversarial Learning Networks: Source-free Unsupervised Domain Incremental Learning. (arXiv:2301.12054v1 [cs.LG])
    This work presents an approach for incrementally updating deep neural network (DNN) models in a non-stationary environment. DNN models are sensitive to changes in input data distribution, which limits their application to problem settings with stationary input datasets. In a non-stationary environment, updating a DNN model requires parameter re-training or model fine-tuning. We propose an unsupervised source-free method to update DNN classification models. The contributions of this work are two-fold. First, we use trainable Gaussian prototypes to generate representative samples for future iterations; second, using unsupervised domain adaptation, we incrementally adapt the existing model using unlabelled data. Unlike existing methods, our approach can update a DNN model incrementally for non-stationary source and target tasks without storing past training data. We evaluated our work on incremental sentiment prediction and incremental disease prediction applications and compared our approach to state-of-the-art continual learning, domain adaptation, and ensemble learning methods. Our results show that our approach achieved improved performance compared to existing incremental learning methods. We observe minimal forgetting of past knowledge over many iterations, which can help us develop unsupervised self-learning systems.
    Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation. (arXiv:2208.12401v3 [cs.LG] UPDATED)
    Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework.
    Continual Learning by Modeling Intra-Class Variation. (arXiv:2210.05398v2 [cs.LG] UPDATED)
    It has been observed that neural networks perform poorly when the data or tasks are presented sequentially. Unlike humans, neural networks suffer greatly from catastrophic forgetting, making it impossible to perform life-long learning. To address this issue, memory-based continual learning has been actively studied and stands out as one of the best-performing methods. We examine memory-based continual learning and identify that large variation in the representation space is crucial for avoiding catastrophic forgetting. Motivated by this, we propose to diversify representations by using two types of perturbations: model-agnostic variation (i.e., the variation is generated without the knowledge of the learned neural network) and model-based variation (i.e., the variation is conditioned on the learned neural network). We demonstrate that enlarging representational variation serves as a general principle to improve continual learning. Finally, we perform empirical studies which demonstrate that our method, as a simple plug-and-play component, can consistently improve a number of memory-based continual learning methods by a large margin.
    Applications of Generative Adversarial Networks in Neuroimaging and Clinical Neuroscience. (arXiv:2206.07081v2 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) are one powerful type of deep learning models that have been successfully utilized in numerous fields. They belong to a broader family called generative methods, which generate new data with a probabilistic model by learning sample distribution from real examples. In the clinical context, GANs have shown enhanced capabilities in capturing spatially complex, nonlinear, and potentially subtle disease effects compared to traditional generative methods. This review appraises the existing literature on the applications of GANs in imaging studies of various neurological conditions, including Alzheimer's disease, brain tumors, brain aging, and multiple sclerosis. We provide an intuitive explanation of various GAN methods for each application and further discuss the main challenges, open questions, and promising future directions of leveraging GANs in neuroimaging. We aim to bridge the gap between advanced deep learning methods and neurology research by highlighting how GANs can be leveraged to support clinical decision making and contribute to a better understanding of the structural and functional patterns of brain diseases.
    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v2 [stat.ML] UPDATED)
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.
    Learning Mixtures of Markov Chains and MDPs. (arXiv:2211.09403v2 [stat.ML] UPDATED)
    We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy).
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v2 [cs.LG] UPDATED)
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.
    Progressive Prompts: Continual Learning for Language Models. (arXiv:2301.12314v1 [cs.CL])
    We introduce Progressive Prompts - a simple and efficient approach for continual learning in language models. Our method allows forward transfer and resists catastrophic forgetting, without relying on data replay or a large number of task-specific parameters. Progressive Prompts learns a new soft prompt for each task and sequentially concatenates it with the previously learned prompts, while keeping the base model frozen. Experiments on standard continual learning benchmarks show that our approach outperforms state-of-the-art methods, with an improvement >20% in average test accuracy over the previous best-preforming method on T5 model. We also explore a more challenging continual learning setup with longer sequences of tasks and show that Progressive Prompts significantly outperforms prior methods.
    Let Offline RL Flow: Training Conservative Agents in the Latent Space of Normalizing Flows. (arXiv:2211.11096v2 [cs.LG] UPDATED)
    Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.
    Laplacian-based Semi-Supervised Learning in Multilayer Hypergraphs by Coordinate Descent. (arXiv:2301.12184v1 [cs.LG])
    Graph Semi-Supervised learning is an important data analysis tool, where given a graph and a set of labeled nodes, the aim is to infer the labels to the remaining unlabeled nodes. In this paper, we start by considering an optimization-based formulation of the problem for an undirected graph, and then we extend this formulation to multilayer hypergraphs. We solve the problem using different coordinate descent approaches and compare the results with the ones obtained by the classic gradient descent method. Experiments on synthetic and real-world datasets show the potential of using coordinate descent methods with suitable selection rules.
    Deep Metric Learning with Chance Constraints. (arXiv:2209.09060v2 [cs.CV] UPDATED)
    Deep metric learning (DML) aims to minimize empirical expected loss of the pairwise intra-/inter- class proximity violations in the embedding image. We relate DML to feasibility problem of finite chance constraints. We show that minimizer of proxy-based DML satisfies certain chance constraints, and that the worst case generalization performance of the proxy-based methods can be characterized by the radius of the smallest ball around a class proxy to cover the entire domain of the corresponding class samples, suggesting multiple proxies per class helps performance. To provide a scalable algorithm as well as exploiting more proxies, we consider the chance constraints implied by the minimizers of proxy-based DML instances and reformulate DML as finding a feasible point in intersection of such constraints, resulting in a problem to be approximately solved by iterative projections. Simply put, we repeatedly train a regularized proxy-based loss and re-initialize the proxies with the embeddings of the deliberately selected new samples. We apply our method with the well-accepted losses and evaluate on four popular benchmark datasets for image retrieval. Outperforming state-of-the-art, our method consistently improves the performance of the applied losses. Code is available at: https://github.com/yetigurbuz/ccp-dml
    TemporAI: Facilitating Machine Learning Innovation in Time Domain Tasks for Medicine. (arXiv:2301.12260v1 [cs.LG])
    TemporAI is an open source Python software library for machine learning (ML) tasks involving data with a time component, focused on medicine and healthcare use cases. It supports data in time series, static, and eventmodalities and provides an interface for prediction, causal inference, and time-to-event analysis, as well as common preprocessing utilities and model interpretability methods. The library aims to facilitate innovation in the medical ML space by offering a standardized temporal setting toolkit for model development, prototyping and benchmarking, bridging the gaps in the ML research, healthcare professional, medical/pharmacological industry, and data science communities. TemporAI is available on GitHub (https://github.com/vanderschaarlab/temporai) and we welcome community engagement through use, feedback, and code contributions.
    Factor-augmented tree ensembles. (arXiv:2111.14000v4 [stat.ML] UPDATED)
    This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.
    Understanding Hindsight Goal Relabeling from a Divergence Minimization Perspective. (arXiv:2209.13046v2 [cs.LG] UPDATED)
    Hindsight goal relabeling has become a foundational technique in multi-goal reinforcement learning (RL). The essential idea is that any trajectory can be seen as a sub-optimal demonstration for reaching its final state. Intuitively, learning from those arbitrary demonstrations can be seen as a form of imitation learning (IL). However, the connection between hindsight goal relabeling and imitation learning is not well understood. In this paper, we propose a novel framework to understand hindsight goal relabeling from a divergence minimization perspective. Recasting the goal reaching problem in the IL framework not only allows us to derive several existing methods from first principles, but also provides us with the tools from IL to improve goal reaching algorithms. Experimentally, we find that under hindsight relabeling, Q-learning outperforms behavioral cloning (BC). Yet, a vanilla combination of both hurts performance. Concretely, we see that the BC loss only helps when selectively applied to actions that get the agent closer to the goal according to the Q-function. Our framework also explains the puzzling phenomenon wherein a reward of (-1, 0) results in significantly better performance than a (0, 1) reward for goal reaching.
    Does Federated Learning Really Need Backpropagation?. (arXiv:2301.12195v1 [cs.LG])
    Federated learning (FL) is a general principle for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. Empirically we use BAFFLE to train deep models from scratch or to finetune pretrained models, achieving acceptable results. Code is available in https://github.com/FengHZ/BAFFLE.
    Learning Effective SDEs from Brownian Dynamics Simulations of Colloidal Particles. (arXiv:2205.00286v3 [math.DS] UPDATED)
    We construct a reduced, data-driven, parameter dependent effective Stochastic Differential Equation (eSDE) for electric-field mediated colloidal crystallization using data obtained from Brownian Dynamics Simulations. We use Diffusion Maps (a manifold learning algorithm) to identify a set of useful latent observables. In this latent space we identify an eSDE using a deep learning architecture inspired by numerical stochastic integrators and compare it with the traditional Kramers-Moyal expansion estimation. We show that the obtained variables and the learned dynamics accurately encode the physics of the Brownian Dynamic Simulations. We further illustrate that our reduced model captures the dynamics of corresponding experimental data. Our dimension reduction/reduced model identification approach can be easily ported to a broad class of particle systems dynamics experiments/models.
    Hierarchical clustering: visualization, feature importance and model selection. (arXiv:2112.01372v2 [stat.ME] UPDATED)
    We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods.  ( 2 min )
    ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs. (arXiv:2210.03052v2 [cs.LG] UPDATED)
    Transformer is the cornerstone model of Natural Language Processing (NLP) over the past decade. Despite its great success in Deep Learning (DL) applications, the increasingly growing parameter space required by transformer models boosts the demand on accelerating the performance of transformer models. In addition, NLP problems can commonly be faced with variable-length sequences since their word numbers can vary among sentences. Existing DL frameworks need to pad variable-length sequences to the maximal length, which, however, leads to significant memory and computational overhead. In this paper, we present ByteTransformer, a high-performance transformer boosted for variable-length inputs. We propose a zero padding algorithm that enables the whole transformer to be free from redundant computations on useless padded tokens. Besides the algorithmic level optimization, we provide architectural-aware optimizations for transformer functioning modules, especially the performance-critical algorithm, multi-head attention (MHA). Experimental results on an NVIDIA A100 GPU with variable-length sequence inputs validate that our fused MHA (FMHA) outperforms the standard PyTorch MHA by 6.13X. The end-to-end performance of ByteTransformer for a standard BERT transformer model surpasses the state-of-the-art Transformer frameworks, such as PyTorch JIT, TensorFlow XLA, Tencent TurboTransformer and NVIDIA FasterTransformer, by 87\%, 131\%, 138\% and 46\%, respectively.  ( 2 min )
    Online Self-Concordant and Relatively Smooth Minimization, With Applications to Online Portfolio Selection and Learning Quantum States. (arXiv:2210.00997v2 [stat.ML] UPDATED)
    Consider an online convex optimization problem where the loss functions are self-concordant barriers, smooth relative to a convex function $h$, and possibly non-Lipschitz. We analyze the regret of online mirror descent with $h$. Then, based on the result, we prove the following in a unified manner. Denote by $T$ the time horizon and $d$ the parameter dimension. 1. For online portfolio selection, the regret of $\widetilde{\text{EG}}$, a variant of exponentiated gradient due to Helmbold et al., is $\tilde{O} ( T^{2/3} d^{1/3} )$ when $T > 4 d / \log d$. This improves on the original $\tilde{O} ( T^{3/4} d^{1/2} )$ regret bound for $\widetilde{\text{EG}}$. 2. For online portfolio selection, the regret of online mirror descent with the logarithmic barrier is $\tilde{O}(\sqrt{T d})$. The regret bound is the same as that of Soft-Bayes due to Orseau et al. up to logarithmic terms. 3. For online learning quantum states with the logarithmic loss, the regret of online mirror descent with the log-determinant function is also $\tilde{O} ( \sqrt{T d} )$. Its per-iteration time is shorter than all existing algorithms we know.  ( 2 min )
    Discovering Limitations of Image Quality Assessments with Noised Deep Learning Image Sets. (arXiv:2210.10249v2 [cs.CV] UPDATED)
    Image quality is important, and can affect overall performance in image processing and computer vision as well as for numerous other reasons. Image quality assessment (IQA) is consequently a vital task in different applications from aerial photography interpretation to object detection to medical image analysis. In previous research, the BRISQUE algorithm and the PSNR algorithm were evaluated with high resolution (atleast 512x384 pixels), but relatively small image sets (no more than 4,744 images). However, scientists have not evaluated IQA algorithms on low resolution (no more than 32x32 pixels), multi-perturbation, big image sets (for example, tleast 60,000 different images not counting their perturbations). This study explores these two IQA algorithms through experimental investigation. We first chose two deep learning image sets, CIFAR-10 and MNIST. Then, we added 68 perturbations that add noise to the images in specific sequences and noise intensities. In addition, we tracked the performance outputs of the two IQA algorithms with singly and multiply noised images. After quantitatively analyzing experimental results, we report the limitations of the two IQAs with these noised CIFAR-10 and MNIST image sets. We also explain three potential root causes for performance degradation. These findings point out weaknesses of the two IQA algorithms. The research results provide guidance to scientists and engineers developing accurate, robust IQA algorithms. All source codes, related image sets, and figures are shared on the website (https://github.com/caperock/imagequality) to support future scientific and industrial projects.  ( 2 min )
    Policy-Adaptive Estimator Selection for Off-Policy Evaluation. (arXiv:2211.13904v2 [cs.LG] UPDATED)
    Off-policy evaluation (OPE) aims to accurately evaluate the performance of counterfactual policies using only offline logged data. Although many estimators have been developed, there is no single estimator that dominates the others, because the estimators' accuracy can vary greatly depending on a given OPE task such as the evaluation policy, number of actions, and noise level. Thus, the data-driven estimator selection problem is becoming increasingly important and can have a significant impact on the accuracy of OPE. However, identifying the most accurate estimator using only the logged data is quite challenging because the ground-truth estimation accuracy of estimators is generally unavailable. This paper studies this challenging problem of estimator selection for OPE for the first time. In particular, we enable an estimator selection that is adaptive to a given OPE task, by appropriately subsampling available logged data and constructing pseudo policies useful for the underlying estimator selection task. Comprehensive experiments on both synthetic and real-world company data demonstrate that the proposed procedure substantially improves the estimator selection compared to a non-adaptive heuristic.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models. (arXiv:2301.12334v1 [cs.LG])
    We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.  ( 2 min )
    Quantum Machine Learning for Decentralized Quantum Protocols with Local Operations and Noisy Classical Communications. (arXiv:2207.11354v2 [quant-ph] UPDATED)
    Distributed quantum information processing protocols such as quantum entanglement distillation and quantum state discrimination rely on local operations and classical communications (LOCC). Existing LOCC-based protocols typically assume the availability of ideal, noiseless, communication channels. In this paper, we study the case in which classical communication takes place over noisy channels, and we propose to address the design of LOCC protocols in this setting via the use of quantum machine learning tools. We specifically focus on the important tasks of quantum entanglement distillation and quantum state discrimination, and implement local processing through parameterized quantum circuits (PQCs) that are optimized to maximize the average fidelity and average success probability in the respective tasks, while accounting for communication errors. The introduced approach, Noise Aware-LOCCNet (NA-LOCCNet), is shown to have significant advantages over existing protocols designed for noiseless communications.  ( 2 min )
    Efficient Enumeration of Markov Equivalent DAGs. (arXiv:2301.12212v1 [cs.AI])
    Enumerating the directed acyclic graphs (DAGs) of a Markov equivalence class (MEC) is an important primitive in causal analysis. The central resource from the perspective of computational complexity is the delay, that is, the time an algorithm that lists all members of the class requires between two consecutive outputs. Commonly used algorithms for this task utilize the rules proposed by Meek (1995) or the transformational characterization by Chickering (1995), both resulting in superlinear delay. In this paper, we present the first linear-time delay algorithm. On the theoretical side, we show that our algorithm can be generalized to enumerate DAGs represented by models that incorporate background knowledge, such as MPDAGs; on the practical side, we provide an efficient implementation and evaluate it in a series of experiments. Complementary to the linear-time delay algorithm, we also provide intriguing insights into Markov equivalence itself: All members of an MEC can be enumerated such that two successive DAGs have structural Hamming distance at most three.  ( 2 min )
    Large Language Models are Zero-Shot Reasoners. (arXiv:2205.11916v4 [cs.CL] UPDATED)
    Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.  ( 3 min )
    Learning Locality and Isotropy in Dialogue Modeling. (arXiv:2205.14583v2 [cs.CL] UPDATED)
    Existing dialogue modeling methods have achieved promising performance on various dialogue tasks with the aid of Transformer and the large-scale pre-trained language models. However, some recent studies revealed that the context representations produced by these methods suffer the problem of anisotropy. In this paper, we find that the generated representations are also not conversational, losing the conversation structure information during the context modeling stage. To this end, we identify two properties in dialogue modeling, i.e., locality and isotropy, and present a simple method for dialogue representation calibration, namely SimDRC, to build isotropic and conversational feature spaces. Experimental results show that our approach significantly outperforms the current state-of-the-art models on three dialogue tasks across the automatic and human evaluation metrics. More in-depth analyses further confirm the effectiveness of our proposed approach.
    Scalable and Equivariant Spherical CNNs by Discrete-Continuous (DISCO) Convolutions. (arXiv:2209.13603v3 [cs.CV] UPDATED)
    No existing spherical convolutional neural network (CNN) framework is both computationally scalable and rotationally equivariant. Continuous approaches capture rotational equivariance but are often prohibitively computationally demanding. Discrete approaches offer more favorable computational performance but at the cost of equivariance. We develop a hybrid discrete-continuous (DISCO) group convolution that is simultaneously equivariant and computationally scalable to high-resolution. While our framework can be applied to any compact group, we specialize to the sphere. Our DISCO spherical convolutions exhibit $\text{SO}(3)$ rotational equivariance, where $\text{SO}(n)$ is the special orthogonal group representing rotations in $n$-dimensions. When restricting rotations of the convolution to the quotient space $\text{SO}(3)/\text{SO}(2)$ for further computational enhancements, we recover a form of asymptotic $\text{SO}(3)$ rotational equivariance. Through a sparse tensor implementation we achieve linear scaling in number of pixels on the sphere for both computational cost and memory usage. For 4k spherical images we realize a saving of $10^9$ in computational cost and $10^4$ in memory usage when compared to the most efficient alternative equivariant spherical convolution. We apply the DISCO spherical CNN framework to a number of benchmark dense-prediction problems on the sphere, such as semantic segmentation and depth estimation, on all of which we achieve the state-of-the-art performance.  ( 2 min )
    RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v6 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting areas, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained "networks of networks" can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.  ( 2 min )
    A One-shot Framework for Distributed Clustered Learning in Heterogeneous Environments. (arXiv:2209.10866v3 [cs.LG] UPDATED)
    The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments in which users obtain data from one of $K$ different data distributions. In the proposed setup, the grouping of users based on the data distributions they sample, as well as the underlying statistical properties of the distributions are apriori unknown. A family of One-shot Distributed Clustered Learning methods (ODCL-$\mathcal{C}$) is proposed, parametrized by the set of admissible clustering algorithms $\mathcal{C}$, with the objective of learning the true model at each user. The admissible clustering methods include $K$-means (KM) and convex clustering (CC), giving rise to various one-shot methods within the proposed family, such as ODCL-KM and ODCL-CC. The proposed one-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. In particular, for strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size. An explicit characterization of the threshold is provided in terms of the problem parameters. Numerical experiments illustrate the findings and corroborate the performance of the proposed methods. We also highlight the trade-offs with respect to selecting various clustering methods (ODCL-CC, ODCL-KM) and demonstrate significant improvements over state-of-the-art.  ( 2 min )
    Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage. (arXiv:2107.03920v5 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, particularly outside asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage for small sample sizes. This paper unifies classical statistics with modern machine learning to present (i) a practical procedure for the Neyman construction of confidence sets with finite-sample guarantees of nominal coverage, and (ii) diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can leverage the LF2I machinery to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our paper discusses the benefits and challenges of LF2I, with a breakdown of the sources of errors in LF2I confidence sets.
    Inference on the Optimal Assortment in the Multinomial Logit Model. (arXiv:2301.12254v1 [stat.ML])
    Assortment optimization has received active explorations in the past few decades due to its practical importance. Despite the extensive literature dealing with optimization algorithms and latent score estimation, uncertainty quantification for the optimal assortment still needs to be explored and is of great practical significance. Instead of estimating and recovering the complete optimal offer set, decision makers may only be interested in testing whether a given property holds true for the optimal assortment, such as whether they should include several products of interest in the optimal set, or how many categories of products the optimal set should include. This paper proposes a novel inferential framework for testing such properties. We consider the widely adopted multinomial logit (MNL) model, where we assume that each customer will purchase an item within the offered products with a probability proportional to the underlying preference score associated with the product. We reduce inferring a general optimal assortment property to quantifying the uncertainty associated with the sign change point detection of the marginal revenue gaps. We show the asymptotic normality of the marginal revenue gap estimator, and construct a maximum statistic via the gap estimators to detect the sign change point. By approximating the distribution of the maximum statistic with multiplier bootstrap techniques, we propose a valid testing procedure. We also conduct numerical experiments to assess the performance of our method.
    TOAST: Topological Algorithm for Singularity Tracking. (arXiv:2210.00069v2 [cs.LG] UPDATED)
    The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.  ( 2 min )
    Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison. (arXiv:2301.12065v1 [cs.LG])
    Privacy-preserving distributed distribution comparison measures the distance between the distributions whose data are scattered across different agents in a distributed system and cannot be shared among the agents. In this study, we propose a novel decentralized entropic optimal transport (EOT) method, which provides a privacy-preserving and communication-efficient solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the decentralized EOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a distributed kernel approximation method, and each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. We analyze our method's communication complexity and provide a theoretical bound for the approximation error caused by the convergence error, the approximated kernel, and the mismatch between the storage and communication protocols. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.
    Data Heterogeneity Differential Privacy: From Theory to Algorithm. (arXiv:2002.08578v2 [cs.LG] UPDATED)
    Traditionally, the random noise is equally injected when training with different data instances in the field of differential privacy (DP). In this paper, we first give sharper excess risk bounds of DP stochastic gradient descent (SGD) method. Considering most of the previous methods are under convex conditions, we use Polyak-{\L}ojasiewicz condition to relax it in this paper. Then, after observing that different training data instances affect the machine learning model to different extent, we consider the heterogeneity of training data and attempt to improve the performance of DP-SGD from a new perspective. Specifically, by introducing the influence function (IF), we quantitatively measure the contributions of various training data on the final machine learning model. If the contribution made by a single data instance is so little that attackers cannot infer anything from the model, we do not add noise when training with it. Based on this observation, we design a `Performance Improving' DP-SGD algorithm: PIDP-SGD. Theoretical and experimental results show that our proposed PIDP-SGD improves the performance significantly.
    How Powerful are Shallow Neural Networks with Bandlimited Random Weights?. (arXiv:2008.08427v2 [cs.LG] UPDATED)
    We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.
    Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. (arXiv:2301.12222v1 [cs.LG])
    Cancer is a term that denotes a group of diseases caused by abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic Acid (DNA) microarrays and Ribonucleic Acid (RNA)- sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the ap-plication of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, reviewed are pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v3 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.  ( 2 min )
    Practical Differentially Private Hyperparameter Tuning with Subsampling. (arXiv:2301.11989v1 [cs.LG])
    Tuning all the hyperparameters of differentially private (DP) machine learning (ML) algorithms often requires use of sensitive data and this may leak private information via hyperparameter values. Recently, Papernot and Steinke (2022) proposed a certain class of DP hyperparameter tuning algorithms, where the number of random search samples is randomized itself. Commonly, these algorithms still considerably increase the DP privacy parameter $\varepsilon$ over non-tuned DP ML model training and can be computationally heavy as evaluating each hyperparameter candidate requires a new training run. We focus on lowering both the DP bounds and the computational complexity of these methods by using only a random subset of the sensitive data for the hyperparameter tuning and by extrapolating the optimal values from the small dataset to a larger dataset. We provide a R\'enyi differential privacy analysis for the proposed method and experimentally show that it consistently leads to better privacy-utility trade-off than the baseline method by Papernot and Steinke (2022).
    (Private) Kernelized Bandits with Distributed Biased Feedback. (arXiv:2301.12061v1 [cs.LG])
    In this paper, we study kernelized bandits with distributed biased feedback. This problem is motivated by several real-world applications (such as dynamic pricing, cellular network configuration, and policy making), where users from a large population contribute to the reward of the action chosen by a central entity, but it is difficult to collect feedback from all users. Instead, only biased feedback (due to user heterogeneity) from a subset of users may be available. In addition to such partial biased feedback, we are also faced with two practical challenges due to communication cost and computation complexity. To tackle these challenges, we carefully design a new \emph{distributed phase-then-batch-based elimination (\texttt{DPBE})} algorithm, which samples users in phases for collecting feedback to reduce the bias and employs \emph{maximum variance reduction} to select actions in batches within each phase. By properly choosing the phase length, the batch size, and the confidence width used for eliminating suboptimal actions, we show that \texttt{DPBE} achieves a sublinear regret of $\tilde{O}(T^{1-\alpha/2}+\sqrt{\gamma_T T})$, where $\alpha\in (0,1)$ is the user-sampling parameter one can tune. Moreover, \texttt{DPBE} can significantly reduce both communication cost and computation complexity in distributed kernelized bandits, compared to some variants of the state-of-the-art algorithms (originally developed for standard kernelized bandits). Furthermore, by incorporating various \emph{differential privacy} models (including the central, local, and shuffle models), we generalize \texttt{DPBE} to provide privacy guarantees for users participating in the distributed learning process. Finally, we conduct extensive simulations to validate our theoretical results and evaluate the empirical performance.
    SaFormer: A Conditional Sequence Modeling Approach to Offline Safe Reinforcement Learning. (arXiv:2301.12203v1 [cs.LG])
    Offline safe RL is of great practical relevance for deploying agents in real-world applications. However, acquiring constraint-satisfying policies from the fixed dataset is non-trivial for conventional approaches. Even worse, the learned constraints are stationary and may become invalid when the online safety requirement changes. In this paper, we present a novel offline safe RL approach referred to as SaFormer, which tackles the above issues via conditional sequence modeling. In contrast to existing sequence models, we propose cost-related tokens to restrict the action space and a posterior safety verification to enforce the constraint explicitly. Specifically, SaFormer performs a two-stage auto-regression conditioned by the maximum remaining cost to generate feasible candidates. It then filters out unsafe attempts and executes the optimal action with the highest expected return. Extensive experiments demonstrate the efficacy of SaFormer featuring (1) competitive returns with tightened constraint satisfaction; (2) adaptability to the in-range cost values of the offline data without retraining; (3) generalizability for constraints beyond the current dataset.  ( 2 min )
    Efficient Latency-Aware CNN Depth Compression via Two-Stage Dynamic Programming. (arXiv:2301.12187v1 [cs.LG])
    Recent works on neural network pruning advocate that reducing the depth of the network is more effective in reducing run-time memory usage and accelerating inference latency than reducing the width of the network through channel pruning. In this regard, some recent works propose depth compression algorithms that merge convolution layers. However, the existing algorithms have a constricted search space and rely on human-engineered heuristics. In this paper, we propose a novel depth compression algorithm which targets general convolution operations. We propose a subset selection problem that replaces inefficient activation layers with identity functions and optimally merges consecutive convolution operations into shallow equivalent convolution operations for efficient end-to-end inference latency. Since the proposed subset selection problem is NP-hard, we formulate a surrogate optimization problem that can be solved exactly via two-stage dynamic programming within a few seconds. We evaluate our methods and baselines by TensorRT for a fair inference latency comparison. Our method outperforms the baseline method with higher accuracy and faster inference speed in MobileNetV2 on the ImageNet dataset. Specifically, we achieve $1.61\times$speed-up with only $0.62$\%p accuracy drop in MobileNetV2-1.4 on the ImageNet.
    A VAE-Bayesian Deep Learning Scheme for Solar Generation Forecasting based on Dimensionality Reduction. (arXiv:2103.12969v2 [cs.LG] UPDATED)
    The advancement of distributed generation technologies in modern power systems has led to a widespread integration of renewable power generation at customer side. However, the intermittent nature of renewable energy poses new challenges to the network operational planning with underlying uncertainties. This paper proposes a novel Bayesian probabilistic technique for forecasting renewable solar generation by addressing data and model uncertainties by integrating bidirectional long short-term memory (BiLSTM) neural networks while compressing the weight parameters using variational autoencoder (VAE). Existing Bayesian deep learning methods suffer from high computational complexities as they require to draw a large number of samples from weight parameters expressed in the form of probability distributions. The proposed method can deal with uncertainty present in model and data in a more computationally efficient manner by reducing the dimensionality of model parameters. The proposed method is evaluated using quantile loss, reconstruction error, and deterministic forecasting evaluation metrics such as root-mean square error. It is inferred from the numerical results that VAE-Bayesian BiLSTM outperforms other probabilistic and deterministic deep learning methods for solar power forecasting in terms of accuracy and computational efficiency for different sizes of the dataset.
    Leveraging Importance Weights in Subset Selection. (arXiv:2301.12052v1 [cs.LG])
    We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.
    Relational Reasoning Networks. (arXiv:2106.00393v3 [cs.AI] UPDATED)
    Neuro-symbolic methods integrate neural architectures, knowledge representation and reasoning. However, they have been struggling at both dealing with the intrinsic uncertainty of the observations and scaling to real-world applications. This paper presents Relational Reasoning Networks (R2N), a novel end-to-end model that performs relational reasoning in the latent space of a deep learner architecture, where the representations of constants, ground atoms and their manipulations are learned in an integrated fashion. Unlike flat architectures like Knowledge Graph Embedders, which can only represent relations between entities, R2Ns define an additional computational structure, accounting for higher-level relations among the ground atoms. The considered relations can be explicitly known, like the ones defined by logic formulas, or defined as unconstrained correlations among groups of ground atoms. R2Ns can be applied to purely symbolic tasks or as a neuro-symbolic platform to integrate learning and reasoning in heterogeneous problems with both symbolic and feature-based represented entities. The proposed model overtakes the limitations of previous neuro-symbolic methods that have been either limited in terms of scalability or expressivity. The proposed methodology is shown to achieve state-of-the-art results in different experimental settings.
    APAC: Authorized Probability-controlled Actor-Critic For Offline Reinforcement Learning. (arXiv:2301.12130v1 [cs.LG])
    Due to the inability to interact with the environment, offline reinforcement learning (RL) methods face the challenge of estimating the Out-of-Distribution (OOD) points. Most existing methods exclude the OOD areas or restrict the value of $Q$ function. However, these methods either are over-conservative or suffer from model uncertainty prediction. In this paper, we propose an authorized probabilistic-control policy learning (APAC) method. The proposed method learns the distribution characteristics of the feasible states/actions by utilizing the flow-GAN model. Specifically, APAC avoids taking action in the low probability density region of behavior policy, while allows exploration in the authorized high probability density region. Theoretical proofs are provided to justify the advantage of APAC. Empirically, APAC outperforms existing alternatives on a variety of simulated tasks, and yields higher expected returns.
    Anticipate, Ensemble and Prune: Improving Convolutional Neural Networks via Aggregated Early Exits. (arXiv:2301.12168v1 [cs.LG])
    Today, artificial neural networks are the state of the art for solving a variety of complex tasks, especially in image classification. Such architectures consist of a sequence of stacked layers with the aim of extracting useful information and having it processed by a classifier to make accurate predictions. However, intermediate information within such models is often left unused. In other cases, such as in edge computing contexts, these architectures are divided into multiple partitions that are made functional by including early exits, i.e. intermediate classifiers, with the goal of reducing the computational and temporal load without extremely compromising the accuracy of the classifications. In this paper, we present Anticipate, Ensemble and Prune (AEP), a new training technique based on weighted ensembles of early exits, which aims at exploiting the information in the structure of networks to maximise their performance. Through a comprehensive set of experiments, we show how the use of this approach can yield average accuracy improvements of up to 15% over traditional training. In its hybrid-weighted configuration, AEP's internal pruning operation also allows reducing the number of parameters by up to 41%, lowering the number of multiplications and additions by 18% and the latency time to make inference by 16%. By using AEP, it is also possible to learn weights that allow early exits to achieve better accuracy values than those obtained from single-output reference models.
    Towards Lossless ANN-SNN Conversion under Ultra-Low Latency with Dual-Phase Optimization. (arXiv:2205.07473v2 [cs.NE] UPDATED)
    Spiking neural network (SNN) operating with asynchronous discrete events shows higher energy efficiency. A popular approach to implementing deep SNNs is ANN-SNN conversion combining both efficient training of ANNs and efficient inference of SNNs. However, due to the intrinsic difference between ANNs and SNNs, the accuracy loss is usually non-negligible, especially under low simulating steps. It restricts the applications of SNN on latency-sensitive edge devices greatly. In this paper, we identify such performance degradation stems from the misrepresentation of the negative or overflow residual membrane potential in SNNs. Inspired by this, we systematically analyze the conversion error between SNNs and ANNs, and then decompose it into three folds: quantization error, clipping error, and residual membrane potential representation error. With such insights, we propose a dual-phase conversion algorithm to minimize those errors separately. Besides, we show each phase achieves significant performance gains in a complementary manner. We evaluate our method on challenging datasets including CIFAR-10, CIFAR-100, and ImageNet datasets. The experimental results show the proposed method achieves the state-of-the-art in terms of both accuracy and latency with promising energy preservation compared to ANNs. For instance, our method achieves an accuracy of 73.20% on CIFAR-100 in only 2 time steps with 15.7$\times$ less energy consumption.
    Unearthing InSights into Mars: unsupervised source separation with limited data. (arXiv:2301.11981v1 [cs.LG])
    Source separation entails the ill-posed problem of retrieving a set of source signals observed through a mixing operator. Solving this problem requires prior knowledge, which is commonly incorporated by imposing regularity conditions on the source signals or implicitly learned in supervised or unsupervised methods from existing data. While data-driven methods have shown great promise in source separation, they are often dependent on large amounts of data, which rarely exists in planetary space missions. Considering this challenge, we propose an unsupervised source separation scheme for domains with limited data access that involves solving an optimization problem in the wavelet scattering representation space$\unicode{x2014}$an interpretable low-dimensional representation of stationary processes. We present a real-data example in which we remove transient thermally induced microtilts, known as glitches, from data recorded by a seismometer during NASA's InSight mission on Mars. Owing to the wavelet scattering covariances' ability to capture non-Gaussian properties of stochastic processes, we are able to separate glitches using only a few glitch-free data snippets.
    Multi-Level Visual Similarity Based Personalized Tourist Attraction Recommendation Using Geo-Tagged Photos. (arXiv:2109.08275v2 [cs.MM] UPDATED)
    Geo-tagged photo based tourist attraction recommendation can discover users' travel preferences from their taken photos, so as to recommend suitable tourist attractions to them. However, existing visual content based methods cannot fully exploit the user and tourist attraction information of photos to extract visual features, and do not differentiate the significances of different photos. In this paper, we propose multi-level visual similarity based personalized tourist attraction recommendation using geo-tagged photos (MEAL). MEAL utilizes the visual contents of photos and interaction behavior data to obtain the final embeddings of users and tourist attractions, which are then used to predict the visit probabilities. Specifically, by crossing the user and tourist attraction information of photos, we define four visual similarity levels and introduce a corresponding quintuplet loss to embed the visual contents of photos. In addition, to capture the significances of different photos, we exploit the self-attention mechanism to obtain the visual representations of users and tourist attractions. We conducted experiments on a dataset crawled from Flickr, and the experimental results proved the advantage of this method.
    Neural Temporal Point Process for Forecasting Higher Order and Directional Interactions. (arXiv:2301.12210v1 [cs.LG])
    Real-world systems are made of interacting entities that evolve with time. Creating models that can forecast interactions by learning the dynamics of entities is an important problem in numerous fields. Earlier works used dynamic graph models to achieve this. However, real-world interactions are more complex than pairwise, as they involve more than two entities, and many of these higher-order interactions have directional components. Examples of these can be seen in communication networks such as email exchanges that involve a sender, and multiple recipients, citation networks, where authors draw upon the work of others, and so on. In this paper, we solve the problem of higher-order directed interaction forecasting by proposing a deep neural network-based model \textit{Directed HyperNode Temporal Point Process} for directed hyperedge event forecasting, as hyperedge provides native framework for modeling relationships among the variable number of nodes. Our proposed technique reduces the search space of possible candidate hyperedges by first forecasting the nodes at which events will be observed, based on which it generates candidate hyperedges. To demonstrate the efficiency of our model, we curated four datasets and conducted an extensive empirical study. We believe that this is the first work that solves the problem of forecasting higher-order directional interactions.
    Prompt-Based Editing for Text Style Transfer. (arXiv:2301.11997v1 [cs.CL])
    Prompting approaches have been recently explored in text style transfer, where a textual prompt is used to query a pretrained language model to generate style-transferred texts word by word in an autoregressive manner. However, such a generation process is less controllable and early prediction errors may affect future word predictions. In this paper, we present a prompt-based editing approach for text style transfer. Specifically, we prompt a pretrained language model for style classification and use the classification probability to compute a style score. Then, we perform discrete search with word-level editing to maximize a comprehensive scoring function for the style-transfer task. In this way, we transform a prompt-based generation problem into a classification one, which is a training-free process and more controllable than the autoregressive generation of sentences. In our experiments, we performed both automatic and human evaluation on three style-transfer benchmark datasets, and show that our approach largely outperforms the state-of-the-art systems that have 20 times more parameters. Additional empirical analyses further demonstrate the effectiveness of our approach.
    AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning. (arXiv:2301.12132v1 [cs.CL])
    Large pretrained language models have been widely used in downstream NLP tasks via task-specific fine-tuning. Recently, an array of Parameter-Efficient Fine-Tuning (PEFT) methods have also achieved strong task performance while updating a much smaller number of parameters compared to full model tuning. However, it is non-trivial to make informed per-task design choices (i.e., to create PEFT configurations) concerning the selection of PEFT architectures and modules, the number of tunable parameters, and even the layers in which the PEFT modules are inserted. Consequently, it is highly likely that the current, manually set PEFT configurations might be suboptimal for many tasks from the perspective of the performance-to-efficiency trade-off. To address the core question of the PEFT configuration selection that aims to control and maximise the balance between performance and parameter efficiency, we first define a rich configuration search space spanning multiple representative PEFT modules along with finer-grained configuration decisions over the modules (e.g., parameter budget, insertion layer). We then propose AutoPEFT, a novel framework to traverse this configuration space: it automatically configures multiple PEFT modules via high-dimensional Bayesian optimisation. We show the resource scalability and task transferability of AutoPEFT-found configurations, outperforming existing PEFT methods on average on the standard GLUE benchmark while conducting the configuration search on a single task. The per-task AutoPEFT-based configuration search even outperforms full-model fine-tuning.
    STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning. (arXiv:2301.12038v1 [cs.LG])
    Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
    Vertex-based reachability analysis for verifying ReLU deep neural networks. (arXiv:2301.12001v1 [cs.LG])
    Neural networks achieved high performance over different tasks, i.e. image identification, voice recognition and other applications. Despite their success, these models are still vulnerable regarding small perturbations, which can be used to craft the so-called adversarial examples. Different approaches have been proposed to circumvent their vulnerability, including formal verification systems, which employ a variety of techniques, including reachability, optimization and search procedures, to verify that the model satisfies some property. In this paper we propose three novel reachability algorithms for verifying deep neural networks with ReLU activations. The first and third algorithms compute an over-approximation for the reachable set, whereas the second one computes the exact reachable set. Differently from previously proposed approaches, our algorithms take as input a V-polytope. Our experiments on the ACAS Xu problem show that the Exact Polytope Network Mapping (EPNM) reachability algorithm proposed in this work surpass the state-of-the-art results from the literature, specially in relation to other reachability methods.
    TIDo: Source-free Task Incremental Learning in Non-stationary Environments. (arXiv:2301.12055v1 [cs.LG])
    This work presents an incremental learning approach for autonomous agents to learn new tasks in a non-stationary environment. Updating a DNN model-based agent to learn new target tasks requires us to store past training data and needs a large labeled target task dataset. Few-shot task incremental learning methods overcome the limitation of labeled target datasets by adapting trained models to learn private target classes using a few labeled representatives and a large unlabeled target dataset. However, the methods assume that the source and target tasks are stationary. We propose a one-shot task incremental learning approach that can adapt to non-stationary source and target tasks. Our approach minimizes adversarial discrepancy between the model's feature space and incoming incremental data to learn an updated hypothesis. We also use distillation loss to reduce catastrophic forgetting of previously learned tasks. Finally, we use Gaussian prototypes to generate exemplar instances eliminating the need to store past training data. Unlike current work in task incremental learning, our model can learn both source and target task updates incrementally. We evaluate our method on various problem settings for incremental object detection and disease prediction model update. We evaluate our approach by measuring the performance of shared class and target private class prediction. Our results show that our approach achieved improved performance compared to existing state-of-the-art task incremental learning methods.
    Analyzing Robustness of the Deep Reinforcement Learning Algorithm in Ramp Metering Applications Considering False Data Injection Attack and Defense. (arXiv:2301.12036v1 [cs.LG])
    Decades of practices of ramp metering, by controlling downstream volume and smoothing the interweaving traffic, have proved that ramp metering can decrease total travel time, mitigate shockwaves, decrease rear-end collisions, reduce pollution, etc. Besides traditional methods like ALIENA algorithms, Deep Reinforcement Learning algorithms have been established recently to build finer control on ramp metering. However, those Deep Learning models may be venerable to adversarial attacks. Thus, it is important to investigate the robustness of those models under False Data Injection adversarial attack. Furthermore, algorithms capable of detecting anomaly data from clean data are the key to safeguard Deep Learning algorithm. In this study, an online algorithm that can distinguish adversarial data from clean data are tested. Results found that in most cases anomaly data can be distinguished from clean data, although their difference is too small to be manually distinguished by humans. In practice, whenever adversarial/hazardous data is detected, the system can fall back to a fixed control program, and experts should investigate the detectors status or security protocols afterwards before real damages happen.
    Autoencoder-Based Unequal Error Protection Codes. (arXiv:2301.12231v1 [cs.IT])
    Most of today's communication systems are designed to target reliable message recovery after receiving the entire encoded message (codeword). However, in many practical scenarios, the transmission process may be interrupted before receiving the complete codeword. This paper proposes a novel rateless autoencoder (AE)-based code design suitable for decoding the transmitted message before the noisy codeword is fully received. Using particular dropout strategies applied during the training process, rateless AE codes allow to trade off between decoding delay and reliability, providing a graceful improvement of the latter with each additionally received codeword symbol. The proposed rateless AEs significantly outperform the conventional AE designs for scenarios where it is desirable to trade off reliability for lower decoding delay.
    SEGA: Instructing Diffusion using Semantic Dimensions. (arXiv:2301.12247v1 [cs.CV])
    Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.  ( 2 min )
    Violation-Aware Contextual Bayesian Optimization for Controller Performance Optimization with Unmodeled Constraints. (arXiv:2301.12099v1 [cs.LG])
    We study the problem of performance optimization of closed-loop control systems with unmodeled dynamics. Bayesian optimization (BO) has been demonstrated to be effective for improving closed-loop performance by automatically tuning controller gains or reference setpoints in a model-free manner. However, BO methods have rarely been tested on dynamical systems with unmodeled constraints and time-varying ambient conditions. In this paper, we propose a violation-aware contextual BO algorithm (VACBO) that optimizes closed-loop performance while simultaneously learning constraint-feasible solutions under time-varying ambient conditions. Unlike classical constrained BO methods which allow unlimited constraint violations, or 'safe' BO algorithms that are conservative and try to operate with near-zero violations, we allow budgeted constraint violations to improve constraint learning and accelerate optimization. We demonstrate the effectiveness of our proposed VACBO method for energy minimization of industrial vapor compression systems under time-varying ambient temperature and humidity.
    Reachability Analysis of Neural Network Control Systems. (arXiv:2301.12100v1 [cs.LG])
    Neural network controllers (NNCs) have shown great promise in autonomous and cyber-physical systems. Despite the various verification approaches for neural networks, the safety analysis of NNCs remains an open problem. Existing verification approaches for neural network control systems (NNCSs) either can only work on a limited type of activation functions, or result in non-trivial over-approximation errors with time evolving. This paper proposes a verification framework for NNCS based on Lipschitzian optimisation, called DeepNNC. We first prove the Lipschitz continuity of closed-loop NNCSs by unrolling and eliminating the loops. We then reveal the working principles of applying Lipschitzian optimisation on NNCS verification and illustrate it by verifying an adaptive cruise control model. Compared to state-of-the-art verification approaches, DeepNNC shows superior performance in terms of efficiency and accuracy over a wide range of NNCs. We also provide a case study to demonstrate the capability of DeepNNC to handle a real-world, practical, and complex system. Our tool \textbf{DeepNNC} is available at \url{https://github.com/TrustAI/DeepNNC}.
    Chaos as an interpretable benchmark for forecasting and data-driven modelling. (arXiv:2110.05266v2 [cs.LG] UPDATED)
    The striking fractal geometry of strange attractors underscores the generative nature of chaos: like probability distributions, chaotic systems can be repeatedly measured to produce arbitrarily-detailed information about the underlying attractor. Chaotic systems thus pose a unique challenge to modern statistical learning techniques, while retaining quantifiable mathematical properties that make them controllable and interpretable as benchmarks. Here, we present a growing database currently comprising 131 known chaotic dynamical systems spanning fields such as astrophysics, climatology, and biochemistry. Each system is paired with precomputed multivariate and univariate time series. Our dataset has comparable scale to existing static time series databases; however, our systems can be re-integrated to produce additional datasets of arbitrary length and granularity. Our dataset is annotated with known mathematical properties of each system, and we perform feature analysis to broadly categorize the diverse dynamics present across the collection. Chaotic systems inherently challenge forecasting models, and across extensive benchmarks we correlate forecasting performance with the degree of chaos present. We also exploit the unique generative properties of our dataset in several proof-of-concept experiments: surrogate transfer learning to improve time series classification, importance sampling to accelerate model training, and benchmarking symbolic regression algorithms.
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v4 [cs.CV] UPDATED)
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    Node Injection for Class-specific Network Poisoning. (arXiv:2301.12277v1 [cs.LG])
    Graph Neural Networks (GNNs) are powerful in learning rich network representations that aid the performance of downstream tasks. However, recent studies showed that GNNs are vulnerable to adversarial attacks involving node injection and network perturbation. Among these, node injection attacks are more practical as they don't require manipulation in the existing network and can be performed more realistically. In this paper, we propose a novel problem statement - a class-specific poison attack on graphs in which the attacker aims to misclassify specific nodes in the target class into a different class using node injection. Additionally, nodes are injected in such a way that they camouflage as benign nodes. We propose NICKI, a novel attacking strategy that utilizes an optimization-based approach to sabotage the performance of GNN-based node classifiers. NICKI works in two phases - it first learns the node representation and then generates the features and edges of the injected nodes. Extensive experiments and ablation studies on four benchmark networks show that NICKI is consistently better than four baseline attacking strategies for misclassifying nodes in the target class. We also show that the injected nodes are properly camouflaged as benign, thus making the poisoned graph indistinguishable from its clean version w.r.t various topological properties.
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v3 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al., (2022). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of the designs in Na et al., (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond. (arXiv:2203.07835v3 [cs.LG] UPDATED)
    With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v3 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.
    A Closer Look at Few-shot Classification Again. (arXiv:2301.12246v1 [cs.LG])
    Few-shot classification consists of a training phase where a model is learned on a relatively large dataset and an adaptation phase where the learned model is adapted to previously-unseen tasks with limited labeled samples. In this paper, we empirically prove that the training algorithm and the adaptation algorithm can be completely disentangled, which allows algorithm analysis and design to be done individually for each phase. Our meta-analysis for each phase reveals several interesting insights that may help better understand key aspects of few-shot classification and connections with other fields such as visual representation learning and transfer learning. We hope the insights and research challenges revealed in this paper can inspire future work in related directions.
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v3 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.
    Improved knowledge distillation by utilizing backward pass knowledge in neural networks. (arXiv:2301.12006v1 [cs.LG])
    Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to better-match the output of the student model to that of the teacher model based on the knowledge extracts from the forward pass of the teacher network. Although conventional KD is effective for matching the two networks over the given data points, there is no guarantee that these models would match in other areas for which we do not have enough training samples. In this work, we address that problem by generating new auxiliary training samples based on extracting knowledge from the backward pass of the teacher in the areas where the student diverges greatly from the teacher. We compute the difference between the teacher and the student and generate new data samples that maximize the divergence. This is done by perturbing data samples in the direction of the gradient of the difference between the student and the teacher. Augmenting the training set by adding this auxiliary improves the performance of KD significantly and leads to a closer match between the student and the teacher. Using this approach, when data samples come from a discrete domain, such as applications of natural language processing (NLP) and language understanding, is not trivial. However, we show how this technique can be used successfully in such applications. We evaluated the performance of our method on various tasks in computer vision and NLP domains and got promising results.
    Controlling Steering with Energy-Based Models. (arXiv:2301.12264v1 [cs.RO])
    So-called implicit behavioral cloning with energy-based models has shown promising results in robotic manipulation tasks. We tested if the method's advantages carry on to controlling the steering of a real self-driving car with an end-to-end driving model. We performed an extensive comparison of the implicit behavioral cloning approach with explicit baseline approaches, all sharing the same neural network backbone architecture. Baseline explicit models were trained with regression (MAE) loss, classification loss (softmax and cross-entropy on a discretization), or as mixture density networks (MDN). While models using the energy-based formulation performed comparably to baseline approaches in terms of safety driver interventions, they had a higher whiteness measure, indicating higher jerk. To alleviate this, we show two methods that can be used to improve the smoothness of steering. We confirmed that energy-based models handle multimodalities slightly better than simple regression, but this did not translate to significantly better driving ability. We argue that the steering-only road-following task has too few multimodalities to benefit from energy-based models. This shows that applying implicit behavioral cloning to real-world tasks can be challenging, and further investigation is needed to bring out the theoretical advantages of energy-based models.
    Optimization for Amortized Inverse Problems. (arXiv:2210.13983v3 [cs.LG] UPDATED)
    Incorporating a deep generative model as the prior distribution in inverse problems has established substantial success in reconstructing images from corrupted observations. Notwithstanding, the existing optimization approaches use gradient descent largely without adapting to the non-convex nature of the problem and can be sensitive to initial values, impeding further performance improvement. In this paper, we propose an efficient amortized optimization scheme for inverse problems with a deep generative prior. Specifically, the optimization task with high degrees of difficulty is decomposed into optimizing a sequence of much easier ones. We provide a theoretical guarantee of the proposed algorithm and empirically validate it on different inverse problems. As a result, our approach outperforms baseline methods qualitatively and quantitatively by a large margin.
    Policy Gradient Methods for Distortion Risk Measures. (arXiv:2107.04422v6 [cs.LG] UPDATED)
    We propose policy gradient algorithms which learn risk-sensitive policies in a reinforcement learning (RL) framework. Our proposed algorithms maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process in on-policy as well as off-policy RL settings. We derive a variant of the policy gradient theorem that caters to the DRM objective, and use this theorem in conjunction with a likelihood ratio-based gradient estimation scheme. We derive non-asymptotic bounds that establish the convergence of our proposed algorithms to an approximate stationary point of the DRM objective.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v4 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.  ( 2 min )
    Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models. (arXiv:2211.17091v2 [cs.CV] UPDATED)
    While there are many score-based models with various diffusing strategies as well as many numerical schemes of the denoising process, only a few works have explored the score part of the generative SDE. This paper introduces a new generative SDE with score adjustment using an auxiliary discriminator. The goal is to improve the original generative process of a pre-trained diffusion model by estimating the gap between the pre-trained score estimation and the true data score. This is done by training a discriminator that classifies diffused real data and diffused sample data. The gap estimation is then used to adjust the pre-trained score network. In experiments, the method enables new SOTA FIDs of 1.77/1.64 on unconditional/conditional CIFAR-10, and new SOTA FID/sFID of 3.18/4.53 on ImageNet 256x256.  ( 2 min )
    Online Markov Decision Processes with Non-oblivious Strategic Adversary. (arXiv:2110.03604v3 [cs.LG] UPDATED)
    We study a novel setting in Online Markov Decision Processes (OMDPs) where the loss function is chosen by a non-oblivious strategic adversary who follows a no-external regret algorithm. In this setting, we first demonstrate that MDP-Expert, an existing algorithm that works well with oblivious adversaries can still apply and achieve a policy regret bound of $\mathcal{O}(\sqrt{T \log(L)}+\tau^2\sqrt{ T \log(|A|)})$ where $L$ is the size of adversary's pure strategy set and $|A|$ denotes the size of agent's action space. Considering real-world games where the support size of a NE is small, we further propose a new algorithm: MDP-Online Oracle Expert (MDP-OOE), that achieves a policy regret bound of $\mathcal{O}(\sqrt{T\log(L)}+\tau^2\sqrt{ T k \log(k)})$ where $k$ depends only on the support size of the NE. MDP-OOE leverages the key benefit of Double Oracle in game theory and thus can solve games with prohibitively large action space. Finally, to better understand the learning dynamics of no-regret methods, under the same setting of no-external regret adversary in OMDPs, we introduce an algorithm that achieves last-round convergence result to a NE. To our best knowledge, this is first work leading to the last iteration result in OMDPs.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 2 min )
    Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics. (arXiv:2211.08227v2 [cs.DC] UPDATED)
    Choosing a good resource configuration for big data analytics applications can be challenging, especially in cloud environments. Automated approaches are desirable as poor decisions can reduce performance and raise costs. The majority of existing automated approaches either build performance models from previous workload executions or conduct iterative resource configuration profiling until a near-optimal solution has been found. In doing so, they only obtain an implicit understanding of the underlying infrastructure, which is difficult to transfer to alternative infrastructures and, thus, profiling and modeling insights are not sustained beyond very specific situations. We present Perona, a novel approach to robust infrastructure fingerprinting for usage in the context of big data analytics. Perona employs common sets and configurations of benchmarking tools for target resources, so that resulting benchmark metrics are directly comparable and ranking is enabled. Insignificant benchmark metrics are discarded by learning a low-dimensional representation of the input metric vector, and previous benchmark executions are taken into consideration for context-awareness as well, allowing to detect resource degradation. We evaluate our approach both on data gathered from our own experiments as well as within related works for resource configuration optimization, demonstrating that Perona captures the characteristics from benchmark runs in a compact manner and produces representations that can be used directly.  ( 2 min )
    Multi-Center Federated Learning: Clients Clustering for Better Personalization. (arXiv:2108.08647v3 [cs.LG] UPDATED)
    Personalized decision-making can be implemented in a Federated learning (FL) framework that can collaboratively train a decision model by extracting knowledge across intelligent clients, e.g. smartphones or enterprises. FL can mitigate the data privacy risk of collaborative training since it merely collects local gradients from users without access to their data. However, FL is fragile in the presence of statistical heterogeneity that is commonly encountered in personalized decision-making, e.g., non-IID data over different clients. Existing FL approaches usually update a single global model to capture the shared knowledge of all users by aggregating their gradients, regardless of the discrepancy between their data distributions. By comparison, a mixture of multiple global models could capture the heterogeneity across various clients if assigning the client to different global models (i.e., centers) in FL. To this end, we propose a novel multi-center aggregation mechanism to cluster clients using their models' parameters. It learns multiple global models from data as the cluster centers, and simultaneously derives the optimal matching between users and centers. We then formulate it as an optimization problem that can be efficiently solved by a stochastic expectation maximization (EM) algorithm. Experiments on multiple benchmark datasets of FL show that our method outperforms several popular baseline methods. The experimental source codes are publicly available on the Github repository https://github.com/mingxuts/multi-center-fed-learning .  ( 2 min )
    Neural Integral Equations. (arXiv:2209.15190v3 [cs.LG] UPDATED)
    Integral equations (IEs) are equations that model spatiotemporal systems with non-local interactions. They have found important applications throughout theoretical and applied sciences, including in physics, chemistry, biology, and engineering. While efficient algorithms exist for solving given IEs, no method exists that can learn an IE and its associated dynamics from data alone. In this paper, we introduce Neural Integral Equations (NIE), a method that learns an unknown integral operator from data through an IE solver. We also introduce Attentional Neural Integral Equations (ANIE), where the integral is replaced by self-attention, which improves scalability and model capacity. We demonstrate that (A)NIE outperforms other methods in both speed and accuracy on several benchmark tasks in ODE, PDE, and IE systems of synthetic and real-world data.  ( 2 min )
    Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks. (arXiv:2301.06646v2 [cs.LG] UPDATED)
    Federated Learning (FL) has gained increasing interest in recent years as a distributed on-device learning paradigm. However, multiple challenges remain to be addressed for deploying FL in real-world Internet-of-Things (IoT) networks with hierarchies. Although existing works have proposed various approaches to account data heterogeneity, system heterogeneity, unexpected stragglers and scalibility, none of them provides a systematic solution to address all of the challenges in a hierarchical and unreliable IoT network. In this paper, we propose an asynchronous and hierarchical framework (Async-HFL) for performing FL in a common three-tier IoT network architecture. In response to the largely varied delays, Async-HFL employs asynchronous aggregations at both the gateway and the cloud levels thus avoids long waiting time. To fully unleash the potential of Async-HFL in converging speed under system heterogeneities and stragglers, we design device selection at the gateway level and device-gateway association at the cloud level. Device selection chooses edge devices to trigger local training in real-time while device-gateway association determines the network topology periodically after several cloud epochs, both satisfying bandwidth limitation. We evaluate Async-HFL's convergence speedup using large-scale simulations based on ns-3 and a network topology from NYCMesh. Our results show that Async-HFL converges 1.08-1.31x faster in wall-clock time and saves up to 21.6% total communication cost compared to state-of-the-art asynchronous FL algorithms (with client selection). We further validate Async-HFL on a physical deployment and observe robust convergence under unexpected stragglers.  ( 2 min )
    Machine Learning Accelerators in 2.5D Chiplet Platforms with Silicon Photonics. (arXiv:2301.12252v1 [cs.AR])
    Domain-specific machine learning (ML) accelerators such as Google's TPU and Apple's Neural Engine now dominate CPUs and GPUs for energy-efficient ML processing. However, the evolution of electronic accelerators is facing fundamental limits due to the limited computation density of monolithic processing chips and the reliance on slow metallic interconnects. In this paper, we present a vision of how optical computation and communication can be integrated into 2.5D chiplet platforms to drive an entirely new class of sustainable and scalable ML hardware accelerators. We describe how cross-layer design and fabrication of optical devices, circuits, and architectures, and hardware/software codesign can help design efficient photonics-based 2.5D chiplet platforms to accelerate emerging ML workloads.  ( 2 min )
    Mutual Wasserstein Discrepancy Minimization for Sequential Recommendation. (arXiv:2301.12197v1 [cs.LG])
    Self-supervised sequential recommendation significantly improves recommendation performance by maximizing mutual information with well-designed data augmentations. However, the mutual information estimation is based on the calculation of Kullback Leibler divergence with several limitations, including asymmetrical estimation, the exponential need of the sample size, and training instability. Also, existing data augmentations are mostly stochastic and can potentially break sequential correlations with random modifications. These two issues motivate us to investigate an alternative robust mutual information measurement capable of modeling uncertainty and alleviating KL divergence limitations. To this end, we propose a novel self-supervised learning framework based on Mutual WasserStein discrepancy minimization MStein for the sequential recommendation. We propose the Wasserstein Discrepancy Measurement to measure the mutual information between augmented sequences. Wasserstein Discrepancy Measurement builds upon the 2-Wasserstein distance, which is more robust, more efficient in small batch sizes, and able to model the uncertainty of stochastic augmentation processes. We also propose a novel contrastive learning loss based on Wasserstein Discrepancy Measurement. Extensive experiments on four benchmark datasets demonstrate the effectiveness of MStein over baselines. More quantitative analyses show the robustness against perturbations and training efficiency in batch size. Finally, improvements analysis indicates better representations of popular users or items with significant uncertainty. The source code is at https://github.com/zfan20/MStein.  ( 2 min )
    Stochastic Dimension-reduced Second-order Methods for Policy Optimization. (arXiv:2301.12174v1 [math.OC])
    In this paper, we propose several new stochastic second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration, making them computationally efficient and comparable to policy gradient methods. Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem. We show that DR-SOPO obtains an $\mathcal{O}(\epsilon^{-3.5})$ complexity for reaching approximate first-order stationary condition and certain subspace second-order stationary condition. In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $\mathcal{O}(\epsilon^{-3})$ based on the variance reduction technique. Preliminary experiments show that our proposed algorithms perform favorably compared with stochastic and variance-reduced policy gradient methods.  ( 2 min )
    Adapting Neural Link Predictors for Complex Query Answering. (arXiv:2301.12313v1 [cs.LG])
    Answering complex queries on incomplete knowledge graphs is a challenging task where a model needs to answer complex logical queries in the presence of missing knowledge. Recently, Arakelyan et al. (2021); Minervini et al. (2022) showed that neural link predictors could also be used for answering complex queries: their Continuous Query Decomposition (CQD) method works by decomposing complex queries into atomic sub-queries, answers them using neural link predictors and aggregates their scores via t-norms for ranking the answers to each complex query. However, CQD does not handle negations and only uses the training signal from atomic training queries: neural link prediction scores are not calibrated to interact together via fuzzy logic t-norms during complex query answering. In this work, we propose to address this problem by training a parameter-efficient score adaptation model to re-calibrate neural link prediction scores: this new component is trained on complex queries by back-propagating through the complex query-answering process. Our method, CQD$^{A}$, produces significantly more accurate results than current state-of-the-art methods, improving from $34.4$ to $35.1$ Mean Reciprocal Rank values averaged across all datasets and query types while using $\leq 35\%$ of the available training query types. We further show that CQD$^{A}$ is data-efficient, achieving competitive results with only $1\%$ of the training data, and robust in out-of-domain evaluations.
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v3 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.
    Unbiased and Efficient Self-Supervised Incremental Contrastive Learning. (arXiv:2301.12104v1 [cs.LG])
    Contrastive Learning (CL) has been proved to be a powerful self-supervised approach for a wide range of domains, including computer vision and graph representation learning. However, the incremental learning issue of CL has rarely been studied, which brings the limitation in applying it to real-world applications. Contrastive learning identifies the samples with the negative ones from the noise distribution that changes in the incremental scenarios. Therefore, only fitting the change of data without noise distribution causes bias, and directly retraining results in low efficiency. To bridge this research gap, we propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of (i) a novel Incremental InfoNCE (NCE-II) loss function by estimating the change of noise distribution for old data to guarantee no bias with respect to the retraining, (ii) a meta-optimization with deep reinforced Learning Rate Learning (LRL) mechanism which can adaptively learn the learning rate according to the status of the training processes and achieve fast convergence which is critical for incremental learning. Theoretically, the proposed ICL is equivalent to retraining, which is based on solid mathematical derivation. In practice, extensive experiments in different domains demonstrate that, without retraining a new model, ICL achieves up to 16.7x training speedup and 16.8x faster convergence with competitive results.  ( 2 min )
    EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval. (arXiv:2301.12005v1 [cs.LG])
    Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR). In this paper, we aim to improve distillation methods that pave the way for the deployment of such models in practice. The proposed distillation approach supports both retrieval and re-ranking stages and crucially leverages the relative geometry among queries and documents learned by the large teacher model. It goes beyond existing distillation methods in the IR literature, which simply rely on the teacher's scalar scores over the training data, on two fronts: providing stronger signals about local geometry via embedding matching and attaining better coverage of data manifold globally via query generation. Embedding matching provides a stronger signal to align the representations of the teacher and student models. At the same time, query generation explores the data manifold to reduce the discrepancies between the student and teacher where training data is sparse. Our distillation approach is theoretically justified and applies to both dual encoder (DE) and cross-encoder (CE) models. Furthermore, for distilling a CE model to a DE model via embedding matching, we propose a novel dual pooling-based scorer for the CE model that facilitates a distillation-friendly embedding geometry, especially for DE student models.  ( 2 min )
    Deep Operator Learning Lessens the Curse of Dimensionality for PDEs. (arXiv:2301.12227v1 [cs.LG])
    Deep neural networks (DNNs) have seen tremendous success in many fields and their developments in PDE-related problems are rapidly growing. This paper provides an estimate for the generalization error of learning Lipschitz operators over Banach spaces using DNNs with applications to various PDE solution operators. The goal is to specify DNN width, depth, and the number of training samples needed to guarantee a certain testing error. Under mild assumptions on data distributions or operator structures, our analysis shows that deep operator learning can have a relaxed dependence on the discretization resolution of PDEs and, hence, lessen the curse of dimensionality in many PDE-related problems. We apply our results to various PDEs, including elliptic equations, parabolic equations, and Burgers equations.
    Meta-Learning Parameterized Skills. (arXiv:2206.03597v2 [cs.LG] UPDATED)
    We propose a novel parameterized skill-learning algorithm that aims to learn transferable parameterized skills and synthesize them into a new action space that supports efficient learning in long-horizon tasks. We propose to leverage off-policy Meta-RL combined with a trajectory-centric smoothness term to learn a set of parameterized skills. Our agent can use these learned skills to construct a three-level hierarchical framework that models a Temporally-extended Parameterized Action Markov Decision Process. We empirically demonstrate that the proposed algorithms enable an agent to solve a set of difficult long-horizon (obstacle-course and robot manipulation) tasks.  ( 2 min )
    MetaNO: How to Transfer Your Knowledge on Learning Hidden Physics. (arXiv:2301.12095v1 [cs.LG])
    Gradient-based meta-learning methods have primarily been applied to classical machine learning tasks such as image classification. Recently, PDE-solving deep learning methods, such as neural operators, are starting to make an important impact on learning and predicting the response of a complex physical system directly from observational data. Since the data acquisition in this context is commonly challenging and costly, the call of utilization and transfer of existing knowledge to new and unseen physical systems is even more acute. Herein, we propose a novel meta-learning approach for neural operators, which can be seen as transferring the knowledge of solution operators between governing (unknown) PDEs with varying parameter fields. Our approach is a provably universal solution operator for multiple PDE solving tasks, with a key theoretical observation that underlying parameter fields can be captured in the first layer of neural operator models, in contrast to typical final-layer transfer in existing meta-learning methods. As applications, we demonstrate the efficacy of our proposed approach on PDE-based datasets and a real-world material modeling problem, illustrating that our method can handle complex and nonlinear physical response learning tasks while greatly improving the sampling efficiency in unseen tasks.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v3 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks which, instead of considering a distribution over weights, samples outputs of each layer from a corresponding Gaussian distribution, parametrized by the predictions of mean and variance sub-layers. In uncertainty quality estimation experiments, we show that the proposed method achieves better uncertainty quality than other single-bin Bayesian Model Averaging methods, such as Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    Selecting Models based on the Risk of Damage Caused by Adversarial Attacks. (arXiv:2301.12151v1 [cs.LG])
    Regulation, legal liabilities, and societal concerns challenge the adoption of AI in safety and security-critical applications. One of the key concerns is that adversaries can cause harm by manipulating model predictions without being detected. Regulation hence demands an assessment of the risk of damage caused by adversaries. Yet, there is no method to translate this high-level demand into actionable metrics that quantify the risk of damage. In this article, we propose a method to model and statistically estimate the probability of damage arising from adversarial attacks. We show that our proposed estimator is statistically consistent and unbiased. In experiments, we demonstrate that the estimation results of our method have a clear and actionable interpretation and outperform conventional metrics. We then show how operators can use the estimation results to reliably select the model with the lowest risk.  ( 2 min )
    Temporal Context Mining for Learned Video Compression. (arXiv:2111.13850v2 [cs.CV] UPDATED)
    We address end-to-end learned video compression with a special focus on better learning and utilizing temporal contexts. For temporal context mining, we propose to store not only the previously reconstructed frames, but also the propagated features into the generalized decoded picture buffer. From the stored propagated features, we propose to learn multi-scale temporal contexts, and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder. Our scheme discards the parallelization-unfriendly auto-regressive entropy model to pursue a more practical decoding time. We compare our scheme with x264 and x265 (representing industrial software for H.264 and H.265, respectively) as well as the official reference software for H.264, H.265, and H.266 (JM, HM, and VTM, respectively). When intra period is 32 and oriented to PSNR, our scheme outperforms H.265--HM by 14.4% bit rate saving; when oriented to MS-SSIM, our scheme outperforms H.266--VTM by 21.1% bit rate saving.  ( 2 min )
    Complexity-Based Prompting for Multi-Step Reasoning. (arXiv:2210.00720v2 [cs.CL] UPDATED)
    We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multi-step reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.  ( 2 min )
    Graph Neural Networks Intersect Probabilistic Graphical Models: A Survey. (arXiv:2206.06089v3 [cs.AI] UPDATED)
    Graphs are a powerful data structure to represent relational data and are widely used to describe complex real-world data structures. Probabilistic Graphical Models (PGMs) have been well-developed in the past years to mathematically model real-world scenarios in compact graphical representations of distributions of variables. Graph Neural Networks (GNNs) are new inference methods developed in recent years and are attracting growing attention due to their effectiveness and flexibility in solving inference and learning problems over graph-structured data. These two powerful approaches have different advantages in capturing relations from observations and how they conduct message passing, and they can benefit each other in various tasks. In this survey, we broadly study the intersection of GNNs and PGMs. Specifically, we first discuss how GNNs can benefit from learning structured representations in PGMs, generate explainable predictions by PGMs, and how PGMs can infer object relationships. Then we discuss how GNNs are implemented in PGMs for more efficient inference and structure learning. In the end, we summarize the benchmark datasets used in recent studies and discuss promising future directions.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v3 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    ClusterFuG: Clustering Fully connected Graphs by Multicut. (arXiv:2301.12159v1 [cs.CV])
    We propose a graph clustering formulation based on multicut (a.k.a. weighted correlation clustering) on the complete graph. Our formulation does not need specification of the graph topology as in the original sparse formulation of multicut, making our approach simpler and potentially better performing. In contrast to unweighted correlation clustering we allow for a more expressive weighted cost structure. In dense multicut, the clustering objective is given in a factorized form as inner products of node feature vectors. This allows for an efficient formulation and inference in contrast to multicut/weighted correlation clustering, which has at least quadratic representation and computation complexity when working on the complete graph. We show how to rewrite classical greedy algorithms for multicut in our dense setting and how to modify them for greater efficiency and solution quality. In particular, our algorithms scale to graphs with tens of thousands of nodes. Empirical evidence on instance segmentation on Cityscapes and clustering of ImageNet datasets shows the merits of our approach.  ( 2 min )
    Masked Contrastive Learning for Anomaly Detection. (arXiv:2105.08793v2 [cs.LG] UPDATED)
    Detecting anomalies is one fundamental aspect of a safety-critical software system, however, it remains a long-standing problem. Numerous branches of works have been proposed to alleviate the complication and have demonstrated their efficiencies. In particular, self-supervised learning based methods are spurring interest due to their capability of learning diverse representations without additional labels. Among self-supervised learning tactics, contrastive learning is one specific framework validating their superiority in various fields, including anomaly detection. However, the primary objective of contrastive learning is to learn task-agnostic features without any labels, which is not entirely suited to discern anomalies. In this paper, we propose a task-specific variant of contrastive learning named masked contrastive learning, which is more befitted for anomaly detection. Moreover, we propose a new inference method dubbed self-ensemble inference that further boosts performance by leveraging the ability learned through auxiliary self-supervision tasks. By combining our models, we can outperform previous state-of-the-art methods by a significant margin on various benchmark datasets.  ( 2 min )
    Protein Representation Learning by Geometric Structure Pretraining. (arXiv:2203.06125v5 [cs.LG] UPDATED)
    Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Existing approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.  ( 2 min )
    CyclicFL: A Cyclic Model Pre-Training Approach to Efficient Federated Learning. (arXiv:2301.12193v1 [cs.LG])
    Since random initial models in Federated Learning (FL) can easily result in unregulated Stochastic Gradient Descent (SGD) processes, existing FL methods greatly suffer from both slow convergence and poor accuracy, especially for non-IID scenarios. To address this problem, we propose a novel FL method named CyclicFL, which can quickly derive effective initial models to guide the SGD processes, thus improving the overall FL training performance. Based on the concept of Continual Learning (CL), we prove that CyclicFL approximates existing centralized pre-training methods in terms of classification and prediction performance. Meanwhile, we formally analyze the significance of data consistency between the pre-training and training stages of CyclicFL, showing the limited Lipschitzness of loss for the pre-trained models by CyclicFL. Unlike traditional centralized pre-training methods that require public proxy data, CyclicFL pre-trains initial models on selected clients cyclically without exposing their local data. Therefore, they can be easily integrated into any security-critical FL methods. Comprehensive experimental results show that CyclicFL can not only improve the classification accuracy by up to 16.21%, but also significantly accelerate the overall FL training processes.  ( 2 min )
    Norm-based Generalization Bounds for Compositionally Sparse Neural Networks. (arXiv:2301.12033v1 [cs.LG])
    In this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons. As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.  ( 2 min )
    On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. (arXiv:1910.08412v3 [cs.LG] UPDATED)
    Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.  ( 2 min )
    Sparse Oblique Decision Trees: A Tool to Understand and Manipulate Neural Net Features. (arXiv:2104.02922v2 [cs.LG] UPDATED)
    The widespread deployment of deep nets in practical applications has lead to a growing desire to understand how and why such black-box methods perform prediction. Much work has focused on understanding what part of the input pattern (an image, say) is responsible for a particular class being predicted, and how the input may be manipulated to predict a different class. We focus instead on understanding which of the internal features computed by the neural net are responsible for a particular class. We achieve this by mimicking part of the neural net with an oblique decision tree having sparse weight vectors at the decision nodes. Using the recently proposed Tree Alternating Optimization (TAO) algorithm, we are able to learn trees that are both highly accurate and interpretable. Such trees can faithfully mimic the part of the neural net they replaced, and hence they can provide insights into the deep net black box. Further, we show we can easily manipulate the neural net features in order to make the net predict, or not predict, a given class, thus showing that it is possible to carry out adversarial attacks at the level of the features. These insights and manipulations apply globally to the entire training and test set, not just at a local (single-instance) level. We demonstrate this robustly in the MNIST and ImageNet datasets with LeNet5 and VGG networks.  ( 2 min )
    Off-Policy Evaluation in Partially Observed Markov Decision Processes under Sequential Ignorability. (arXiv:2110.12343v3 [cs.LG] UPDATED)
    We consider off-policy evaluation of dynamic treatment rules under sequential ignorability, given an assumption that the underlying system can be modeled as a partially observed Markov decision process (POMDP). We propose an estimator, partial history importance weighting, and show that it can consistently estimate the stationary mean rewards of a target policy given long enough draws from the behavior policy. We provide an upper bound on its error that decays polynomially in the number of observations (i.e., the number of trajectories times their length), with an exponent that depends on the overlap of the target and behavior policies, and on the mixing time of the underlying system. Furthermore, we show that this rate of convergence is minimax given only our assumptions on mixing and overlap. Our results establish that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes, but strictly easier than model-free off-policy evaluation.  ( 2 min )
    Pragmatic Fairness: Developing Policies with Outcome Disparity Control. (arXiv:2301.12278v1 [cs.LG])
    We introduce a causal framework for designing optimal policies that satisfy fairness constraints. We take a pragmatic approach asking what we can do with an action space available to us and only with access to historical data. We propose two different fairness constraints: a moderation breaking constraint which aims at blocking moderation paths from the action and sensitive attribute to the outcome, and by that at reducing disparity in outcome levels as much as the provided action space permits; and an equal benefit constraint which aims at distributing gain from the new and maximized policy equally across sensitive attribute levels, and thus at keeping pre-existing preferential treatment in place or avoiding the introduction of new disparity. We introduce practical methods for implementing the constraints and illustrate their uses on experiments with semi-synthetic models.  ( 2 min )
    Heterogeneous Datasets for Federated Survival Analysis Simulation. (arXiv:2301.12166v1 [cs.LG])
    Survival analysis studies time-modeling techniques for an event of interest occurring for a population. Survival analysis found widespread applications in healthcare, engineering, and social sciences. However, the data needed to train survival models are often distributed, incomplete, censored, and confidential. In this context, federated learning can be exploited to tremendously improve the quality of the models trained on distributed data while preserving user privacy. However, federated survival analysis is still in its early development, and there is no common benchmarking dataset to test federated survival models. This work proposes a novel technique for constructing realistic heterogeneous datasets by starting from existing non-federated datasets in a reproducible way. Specifically, we provide two novel dataset-splitting algorithms based on the Dirichlet distribution to assign each data sample to a carefully chosen client: quantity-skewed splitting and label-skewed splitting. Furthermore, these algorithms allow for obtaining different levels of heterogeneity by changing a single hyperparameter. Finally, numerical experiments provide a quantitative evaluation of the heterogeneity level using log-rank tests and a qualitative analysis of the generated splits. The implementation of the proposed methods is publicly available in favor of reproducibility and to encourage common practices to simulate federated environments for survival analysis.  ( 2 min )
    GFlowNets and variational inference. (arXiv:2210.00580v2 [cs.LG] UPDATED)
    This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v3 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    A Dependable Hybrid Machine Learning Model for Network Intrusion Detection. (arXiv:2212.04546v2 [cs.CR] UPDATED)
    Network intrusion detection systems (NIDSs) play an important role in computer network security. There are several detection mechanisms where anomaly-based automated detection outperforms others significantly. Amid the sophistication and growing number of attacks, dealing with large amounts of data is a recognized issue in the development of anomaly-based NIDS. However, do current models meet the needs of today's networks in terms of required accuracy and dependability? In this research, we propose a new hybrid model that combines machine learning and deep learning to increase detection rates while securing dependability. Our proposed method ensures efficient pre-processing by combining SMOTE for data balancing and XGBoost for feature selection. We compared our developed method to various machine learning and deep learning algorithms to find a more efficient algorithm to implement in the pipeline. Furthermore, we chose the most effective model for network intrusion based on a set of benchmarked performance analysis criteria. Our method produces excellent results when tested on two datasets, KDDCUP'99 and CIC-MalMem-2022, with an accuracy of 99.99% and 100% for KDDCUP'99 and CIC-MalMem-2022, respectively, and no overfitting or Type-1 and Type-2 issues.  ( 2 min )
    Jump Interval-Learning for Individualized Decision Making. (arXiv:2111.08885v2 [stat.ME] UPDATED)
    An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.  ( 2 min )
    Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. (arXiv:2301.12017v1 [cs.CL])
    Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency while preserving model accuracy, it remains unclear whether we can leverage INT4 (which doubles peak hardware throughput) to achieve further latency improvement. In this work, we fully investigate the feasibility of using INT4 quantization for language models, and show that using INT4 introduces no or negligible accuracy degradation for encoder-only and encoder-decoder models, but causes a significant accuracy drop for decoder-only models. To materialize the performance gain using INT4, we develop a highly-optimized end-to-end INT4 encoder inference pipeline supporting different quantization strategies. Our INT4 pipeline is $8.5\times$ faster for latency-oriented scenarios and up to $3\times$ for throughput-oriented scenarios compared to the inference of FP16, and improves the SOTA BERT INT8 performance from FasterTransformer by up to $1.7\times$. We also provide insights into the failure cases when applying INT4 to decoder-only models, and further explore the compatibility of INT4 quantization with other compression techniques, like pruning and layer reduction.  ( 2 min )
    On the Lipschitz Constant of Deep Networks and Double Descent. (arXiv:2301.12309v1 [cs.LG])
    Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors -- namely loss landscape curvature and distance of parameters from initialization -- respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novels insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.  ( 2 min )
    Continual Graph Learning: A Survey. (arXiv:2301.12230v1 [cs.LG])
    Research on continual learning (CL) mainly focuses on data represented in the Euclidean space, while research on graph-structured data is scarce. Furthermore, most graph learning models are tailored for static graphs. However, graphs usually evolve continually in the real world. Catastrophic forgetting also emerges in graph learning models when being trained incrementally. This leads to the need to develop robust, effective and efficient continual graph learning approaches. Continual graph learning (CGL) is an emerging area aiming to realize continual learning on graph-structured data. This survey is written to shed light on this emerging area. It introduces the basic concepts of CGL and highlights two unique challenges brought by graphs. Then it reviews and categorizes recent state-of-the-art approaches, analyzing their strategies to tackle the unique challenges in CGL. Besides, it discusses the main concerns in each family of CGL methods, offering potential solutions. Finally, it explores the open issues and potential applications of CGL.  ( 2 min )
    Zero-shot causal learning. (arXiv:2301.12292v1 [cs.LG])
    Predicting how different interventions will causally affect a specific individual is important in a variety of domains such as personalized medicine, public policy, and online marketing. However, most existing causal methods cannot generalize to predicting the effects of previously unseen interventions (e.g., a newly invented drug), because they require data for individuals who received the intervention. Here, we consider zero-shot causal learning: predicting the personalized effects of novel, previously unseen interventions. To tackle this problem, we propose CaML, a causal meta-learning framework which formulates the personalized prediction of each intervention's effect as a task. Rather than training a separate model for each intervention, CaML trains as a single meta-model across thousands of tasks, each constructed by sampling an intervention and individuals who either did or did not receive it. By leveraging both intervention information (e.g., a drug's attributes) and individual features (e.g., a patient's history), CaML is able to predict the personalized effects of unseen interventions. Experimental results on real world datasets in large-scale medical claims and cell-line perturbations demonstrate the effectiveness of our approach. Most strikingly, CaML zero-shot predictions outperform even strong baselines which have direct access to data of considered target interventions.  ( 2 min )
    Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. (arXiv:2005.05409v2 [math.OC] UPDATED)
    Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.  ( 2 min )
    Alignment with human representations supports robust few-shot learning. (arXiv:2301.11990v1 [cs.LG])
    Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.  ( 2 min )
    Turbulence control in plane Couette flow using low-dimensional neural ODE-based models and deep reinforcement learning. (arXiv:2301.12098v1 [physics.flu-dyn])
    The high dimensionality and complex dynamics of turbulent flows remain an obstacle to the discovery and implementation of control strategies. Deep reinforcement learning (RL) is a promising avenue for overcoming these obstacles, but requires a training phase in which the RL agent iteratively interacts with the flow environment to learn a control policy, which can be prohibitively expensive when the environment involves slow experiments or large-scale simulations. We overcome this challenge using a framework we call "DManD-RL" (data-driven manifold dynamics-RL), which generates a data-driven low-dimensional model of our system that we use for RL training. With this approach, we seek to minimize drag in a direct numerical simulation (DNS) of a turbulent minimal flow unit of plane Couette flow at Re=400 using two slot jets on one wall. We obtain, from DNS data with $\mathcal{O}(10^5)$ degrees of freedom, a 25-dimensional DManD model of the dynamics by combining an autoencoder and neural ordinary differential equation. Using this model as the environment, we train an RL control agent, yielding a 440-fold speedup over training on the DNS, with equivalent control performance. The agent learns a policy that laminarizes 84% of unseen DNS test trajectories within 900 time units, significantly outperforming classical opposition control (58%), despite the actuation authority being much more restricted. The agent often achieves laminarization through a counterintuitive strategy that drives the formation of two low-speed streaks, with a spanwise wavelength that is too small to be self-sustaining. The agent demonstrates the same performance when we limit observations to wall shear rate.  ( 2 min )
    Do Embodied Agents Dream of Pixelated Sheep?: Embodied Decision Making using Language Guided World Modelling. (arXiv:2301.12050v1 [cs.LG])
    Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world, which makes learning complex tasks with sparse rewards difficult. If initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that is tested and verified during exploration, to improve sample efficiency in embodied RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM on the basis of its experiences. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.  ( 2 min )
    Harnessing the Power of Decision Trees to Detect IoT Malware. (arXiv:2301.12039v1 [cs.CR])
    Due to its simple installation and connectivity, the Internet of Things (IoT) is susceptible to malware attacks. Being able to operate autonomously. As IoT devices have become more prevalent, they have become the most tempting targets for malware. Weak, guessable, or hard-coded passwords, and a lack of security measures contribute to these vulnerabilities along with insecure network connections and outdated update procedures. To understand IoT malware, current methods and analysis ,using static methods, are ineffective. The field of deep learning has made great strides in recent years due to their tremendous data mining, learning, and expression capabilities, cybersecurity has enjoyed tremendous growth in recent years. As a result, malware analysts will not have to spend as much time analyzing malware. In this paper, we propose a novel detection and analysis method that harnesses the power and simplicity of decision trees. The experiments are conducted using a real word dataset, MaleVis which is a publicly available dataset. Based on the results, we show that our proposed approach outperforms existing state-of-the-art solutions in that it achieves 97.23% precision and 95.89% recall in terms of detection and classification. A specificity of 96.58%, F1-score of 96.40%, an accuracy of 96.43.  ( 2 min )
    Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction. (arXiv:2301.12068v1 [cs.LG])
    Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their joint energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a physics-inspired method called Siamese Diffusion Trajectory Prediction (SiamDiff) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom- and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. The source code will be released upon acceptance.  ( 2 min )
    On the Feasibility of Machine Learning Augmented Magnetic Resonance for Point-of-Care Identification of Disease. (arXiv:2301.11962v1 [cs.LG])
    Early detection of many life-threatening diseases (e.g., prostate and breast cancer) within at-risk population can improve clinical outcomes and reduce cost of care. While numerous disease-specific "screening" tests that are closer to Point-of-Care (POC) are in use for this task, their low specificity results in unnecessary biopsies, leading to avoidable patient trauma and wasteful healthcare spending. On the other hand, despite the high accuracy of Magnetic Resonance (MR) imaging in disease diagnosis, it is not used as a POC disease identification tool because of poor accessibility. The root cause of poor accessibility of MR stems from the requirement to reconstruct high-fidelity images, as it necessitates a lengthy and complex process of acquiring large quantities of high-quality k-space measurements. In this study we explore the feasibility of an ML-augmented MR pipeline that directly infers the disease sidestepping the image reconstruction process. We hypothesise that the disease classification task can be solved using a very small tailored subset of k-space data, compared to image reconstruction. Towards that end, we propose a method that performs two tasks: 1) identifies a subset of the k-space that maximizes disease identification accuracy, and 2) infers the disease directly using the identified k-space subset, bypassing the image reconstruction step. We validate our hypothesis by measuring the performance of the proposed system across multiple diseases and anatomies. We show that comparable performance to image-based classifiers, trained on images reconstructed with full k-space data, can be achieved using small quantities of data: 8% of the data for detecting multiple abnormalities in prostate and brain scans, and 5% of the data for knee abnormalities. To better understand the proposed approach and instigate future research, we provide an extensive analysis and release code.  ( 2 min )
    RCsearcher: Reaction Center Identification in Retrosynthesis via Deep Q-Learning. (arXiv:2301.12071v1 [cs.LG])
    The reaction center consists of atoms in the product whose local properties are not identical to the corresponding atoms in the reactants. Prior studies on reaction center identification are mainly on semi-templated retrosynthesis methods. Moreover, they are limited to single reaction center identification. However, many reaction centers are comprised of multiple bonds or atoms in reality. We refer to it as the multiple reaction center. This paper presents RCsearcher, a unified framework for single and multiple reaction center identification that combines the advantages of the graph neural network and deep reinforcement learning. The critical insight in this framework is that the single or multiple reaction center must be a node-induced subgraph of the molecular product graph. At each step, it considers choosing one node in the molecular product graph and adding it to the explored node-induced subgraph as an action. Comprehensive experiments demonstrate that RCsearcher consistently outperforms other baselines and can extrapolate the reaction center patterns that have not appeared in the training set. Ablation experiments verify the effectiveness of individual components, including the beam search and one-hop constraint of action space.  ( 2 min )
    Predicting Students' Exam Scores Using Physiological Signals. (arXiv:2301.12051v1 [cs.LG])
    While acute stress has been shown to have both positive and negative effects on performance, not much is known about the impacts of stress on students grades during examinations. To answer this question, we examined whether a correlation could be found between physiological stress signals and exam performance. We conducted this study using multiple physiological signals of ten undergraduate students over three different exams. The study focused on three signals, i.e., skin temperature, heart rate, and electrodermal activity. We extracted statistics as features and fed them into a variety of binary classifiers to predict relatively higher or lower grades. Experimental results showed up to 0.81 ROC-AUC with k-nearest neighbor algorithm among various machine learning algorithms.  ( 2 min )
    Restricted Orthogonal Gradient Projection for Continual Learning. (arXiv:2301.12131v1 [cs.LG])
    Continual learning aims to avoid catastrophic forgetting and effectively leverage learned experiences to master new knowledge. Existing gradient projection approaches impose hard constraints on the optimization space for new tasks to minimize interference, which simultaneously hinders forward knowledge transfer. To address this issue, recent methods reuse frozen parameters with a growing network, resulting in high computational costs. Thus, it remains a challenge whether we can improve forward knowledge transfer for gradient projection approaches using a fixed network architecture. In this work, we propose the Restricted Orthogonal Gradient prOjection (ROGO) framework. The basic idea is to adopt a restricted orthogonal constraint allowing parameters optimized in the direction oblique to the whole frozen space to facilitate forward knowledge transfer while consolidating previous knowledge. Our framework requires neither data buffers nor extra parameters. Extensive experiments have demonstrated the superiority of our framework over several strong baselines. We also provide theoretical guarantees for our relaxing strategy.  ( 2 min )
    ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts. (arXiv:2301.12040v1 [q-bio.BM])
    Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM's original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.  ( 2 min )
    A Memory Efficient Deep Reinforcement Learning Approach For Snake Game Autonomous Agents. (arXiv:2301.11977v1 [cs.AI])
    To perform well, Deep Reinforcement Learning (DRL) methods require significant memory resources and computational time. Also, sometimes these systems need additional environment information to achieve a good reward. However, it is more important for many applications and devices to reduce memory usage and computational times than to achieve the maximum reward. This paper presents a modified DRL method that performs reasonably well with compressed imagery data without requiring additional environment information and also uses less memory and time. We have designed a lightweight Convolutional Neural Network (CNN) with a variant of the Q-network that efficiently takes preprocessed image data as input and uses less memory. Furthermore, we use a simple reward mechanism and small experience replay memory so as to provide only the minimum necessary information. Our modified DRL method enables our autonomous agent to play Snake, a classical control game. The results show our model can achieve similar performance as other DRL methods.  ( 2 min )
    Minimizing Trajectory Curvature of ODE-based Generative Models. (arXiv:2301.12003v1 [cs.LG])
    Recent ODE/SDE-based generative models, such as diffusion models and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.  ( 2 min )
    Physics-informed Neural Network: The Effect of Reparameterization in Solving Differential Equations. (arXiv:2301.12118v1 [cs.LG])
    Differential equations are used to model and predict the behaviour of complex systems in a wide range of fields, and the ability to solve them is an important asset for understanding and predicting the behaviour of these systems. Complicated physics mostly involves difficult differential equations, which are hard to solve analytically. In recent years, physics-informed neural networks have been shown to perform very well in solving systems with various differential equations. The main ways to approximate differential equations are through penalty function and reparameterization. Most researchers use penalty functions rather than reparameterization due to the complexity of implementing reparameterization. In this study, we quantitatively compare physics-informed neural network models with and without reparameterization using the approximation error. The performance of reparameterization is demonstrated based on two benchmark mechanical engineering problems, a one-dimensional bar problem and a two-dimensional bending beam problem. Our results show that when dealing with complex differential equations, applying reparameterization results in a lower approximation error.  ( 2 min )
    Reduced-Order Autodifferentiable Ensemble Kalman Filters. (arXiv:2301.11961v1 [stat.ML])
    This paper introduces a computational framework to reconstruct and forecast a partially observed state that evolves according to an unknown or expensive-to-simulate dynamical system. Our reduced-order autodifferentiable ensemble Kalman filters (ROAD-EnKFs) learn a latent low-dimensional surrogate model for the dynamics and a decoder that maps from the latent space to the state space. The learned dynamics and decoder are then used within an ensemble Kalman filter to reconstruct and forecast the state. Numerical experiments show that if the state dynamics exhibit a hidden low-dimensional structure, ROAD-EnKFs achieve higher accuracy at lower computational cost compared to existing methods. If such structure is not expressed in the latent state dynamics, ROAD-EnKFs achieve similar accuracy at lower cost, making them a promising approach for surrogate state reconstruction and forecasting.  ( 2 min )
    Meta Temporal Point Processes. (arXiv:2301.12023v1 [cs.LG])
    A temporal point process (TPP) is a stochastic process where its realization is a sequence of discrete events in time. Recent work in TPPs model the process using a neural network in a supervised learning framework, where a training set is a collection of all the sequences. In this work, we propose to train TPPs in a meta learning framework, where each sequence is treated as a different task, via a novel framing of TPPs as neural processes (NPs). We introduce context sets to model TPPs as an instantiation of NPs. Motivated by attentive NP, we also introduce local history matching to help learn more informative features. We demonstrate the potential of the proposed method on popular public benchmark datasets and tasks, and compare with state-of-the-art TPP methods.  ( 2 min )
    Variational Latent Branching Model for Off-Policy Evaluation. (arXiv:2301.12056v1 [cs.LG])
    Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.  ( 2 min )
    ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts. (arXiv:2301.12171v1 [cs.CV])
    Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.  ( 2 min )
    Folded Optimization for End-to-End Model-Based Learning. (arXiv:2301.12047v1 [cs.LG])
    The integration of constrained optimization models as components in deep networks has led to promising advances in both these domains. A primary challenge in this setting is backpropagation through the optimization mapping, which typically lacks a closed form. A common approach is unrolling, which relies on automatic differentiation through the operations of an iterative solver. While flexible and general, unrolling can encounter accuracy and efficiency issues in practice. These issues can be avoided by differentiating the optimization mapping analytically, but current frameworks impose rigid requirements on the optimization problem's form. This paper provides theoretical insights into the backpropagation of unrolled optimizers, which lead to a system for generating equivalent but efficiently solvable analytical models. Additionally, it proposes a unifying view of unrolling and analytical differentiation through constrained optimization mappings. Experiments over various structured prediction and decision-focused learning tasks illustrate the potential of the approach both computationally and in terms of enhanced expressiveness.  ( 2 min )
    Information loss from dimensionality reduction in 5D-Gaussian spectral data. (arXiv:2301.11923v1 [physics.data-an])
    Understanding the loss of information in spectral analytics is a crucial first step towards finding root causes for failures and uncertainties using spectral data in artificial intelligence models built from modern complex data science applications. Here, we show from a very simple entropy model analysis with quantum statistics of spectral data, that the relative loss of information from dimensionality reduction due to projection of an initial five-dimensional state onto two-dimensional diagrams is less than one percent in the parameter range of small data sets with sample sizes on the order of few hundreds data samples. From our analysis, we also conclude that the density and expectation value of the entropy probability distribution increases with the sample number and sample size using artificial data models derived from random sampling Monte-Carlo simulation methods.  ( 2 min )
    On the Connection Between MPNN and Graph Transformer. (arXiv:2301.11956v1 [cs.LG])
    Graph Transformer (GT) recently has emerged as a new paradigm of graph learning algorithms, outperforming the previously popular Message Passing Neural Network (MPNN) on multiple benchmarks. Previous work (Kim et al., 2022) shows that with proper position embedding, GT can approximate MPNN arbitrarily well, implying that GT is at least as powerful as MPNN. In this paper, we study the inverse connection and show that MPNN with virtual node (VN), a commonly used heuristic with little theoretical understanding, is powerful enough to arbitrarily approximate the self-attention layer of GT. In particular, we first show that if we consider one type of linear transformer, the so-called Performer/Linear Transformer (Choromanski et al., 2020; Katharopoulos et al., 2020), then MPNN + VN with only O(1) depth and O(1) width can approximate a self-attention layer in Performer/Linear Transformer. Next, via a connection between MPNN + VN and DeepSets, we prove the MPNN + VN with O(n^d) width and O(1) depth can approximate the self-attention layer arbitrarily well, where d is the input feature dimension. Lastly, under some assumptions, we provide an explicit construction of MPNN + VN with O(1) width and O(n) depth approximating the self-attention layer in GT arbitrarily well. On the empirical side, we demonstrate that 1) MPNN + VN is a surprisingly strong baseline, outperforming GT on the recently proposed Long Range Graph Benchmark (LRGB) dataset, 2) our MPNN + VN improves over early implementation on a wide range of OGB datasets and 3) MPNN + VN outperforms Linear Transformer and MPNN on the climate modeling task.  ( 2 min )
    Neural Relation Graph for Identifying Problematic Data. (arXiv:2301.12321v1 [cs.LG])
    Diagnosing and cleaning datasets are crucial for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is difficult due to the presence of complex issues, such as label errors or under-representation of certain types. In this paper, we propose a novel approach for identifying problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. We develop an efficient algorithm for detecting label errors and outlier data points based on the relational graph structure of the dataset. We further introduce a visualization tool for contextualizing data points, which can serve as an effective tool for interactively diagnosing datasets. We evaluate label error and out-of-distribution detection performances on large-scale image and language domain tasks, including ImageNet and GLUE benchmarks, and demonstrate the effectiveness of our approach for debugging datasets and building robust machine learning systems.  ( 2 min )
    Accelerated Training of Physics-Informed Neural Networks (PINNs) using Meshless Discretizations. (arXiv:2205.09332v6 [cs.LG] UPDATED)
    We present a new technique for the accelerated training of physics-informed neural networks (PINNs): discretely-trained PINNs (DT-PINNs). The repeated computation of partial derivative terms in the PINN loss functions via automatic differentiation during training is known to be computationally expensive, especially for higher-order derivatives. DT-PINNs are trained by replacing these exact spatial derivatives with high-order accurate numerical discretizations computed using meshless radial basis function-finite differences (RBF-FD) and applied via sparse-matrix vector multiplication. The use of RBF-FD allows for DT-PINNs to be trained even on point cloud samples placed on irregular domain geometries. Additionally, though traditional PINNs (vanilla-PINNs) are typically stored and trained in 32-bit floating-point (fp32) on the GPU, we show that for DT-PINNs, using fp64 on the GPU leads to significantly faster training times than fp32 vanilla-PINNs with comparable accuracy. We demonstrate the efficiency and accuracy of DT-PINNs via a series of experiments. First, we explore the effect of network depth on both numerical and automatic differentiation of a neural network with random weights and show that RBF-FD approximations of third-order accuracy and above are more efficient while being sufficiently accurate. We then compare the DT-PINNs to vanilla-PINNs on both linear and nonlinear Poisson equations and show that DT-PINNs achieve similar losses with 2-4x faster training times on a consumer GPU. Finally, we also demonstrate that similar results can be obtained for the PINN solution to the heat equation (a space-time problem) by discretizing the spatial derivatives using RBF-FD and using automatic differentiation for the temporal derivative. Our results show that fp64 DT-PINNs offer a superior cost-accuracy profile to fp32 vanilla-PINNs.  ( 3 min )
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v2 [stat.ML] UPDATED)
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders.  ( 2 min )
    Informational Diversity and Affinity Bias in Team Growth Dynamics. (arXiv:2301.12091v1 [cs.GT])
    Prior work has provided strong evidence that, within organizational settings, teams that bring a diversity of information and perspectives to a task are more effective than teams that do not. If this form of informational diversity confers performance advantages, why do we often see largely homogeneous teams in practice? One canonical argument is that the benefits of informational diversity are in tension with affinity bias. To better understand the impact of this tension on the makeup of teams, we analyze a sequential model of team formation in which individuals care about their team's performance (captured in terms of accurately predicting some future outcome based on a set of features) but experience a cost as a result of interacting with teammates who use different approaches to the prediction task. Our analysis of this simple model reveals a set of subtle behaviors that team-growth dynamics can exhibit: (i) from certain initial team compositions, they can make progress toward better performance but then get stuck partway to optimally diverse teams; while (ii) from other initial compositions, they can also move away from this optimal balance as the majority group tries to crowd out the opinions of the minority. The initial composition of the team can determine whether the dynamics will move toward or away from performance optimality, painting a path-dependent picture of inefficiencies in team compositions. Our results formalize a fundamental limitation of utility-based motivations to drive informational diversity in organizations and hint at interventions that may improve informational diversity and performance simultaneously.  ( 2 min )
    Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2211.05964v2 [stat.ML] UPDATED)
    We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm using special classes of sparsity-inducing priors (e.g., spike-and-slab) to model the unknown parameter and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high-dimensional and sparse contextual bandits. For faster computation, we use variational inference instead of Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. Extensive simulations demonstrate the improved performance of our proposed algorithm over existing ones.  ( 2 min )
    Adversarial Networks and Machine Learning for File Classification. (arXiv:2301.11964v1 [cs.LG])
    Correctly identifying the type of file under examination is a critical part of a forensic investigation. The file type alone suggests the embedded content, such as a picture, video, manuscript, spreadsheet, etc. In cases where a system owner might desire to keep their files inaccessible or file type concealed, we propose using an adversarially-trained machine learning neural network to determine a file's true type even if the extension or file header is obfuscated to complicate its discovery. Our semi-supervised generative adversarial network (SGAN) achieved 97.6% accuracy in classifying files across 11 different types. We also compared our network against a traditional standalone neural network and three other machine learning algorithms. The adversarially-trained network proved to be the most precise file classifier especially in scenarios with few supervised samples available. Our implementation of a file classifier using an SGAN is implemented on GitHub (https://ksaintg.github.io/SGAN-File-Classier).  ( 2 min )
    Neural Gas Network Image Features and Segmentation for Brain Tumor Detection Using Magnetic Resonance Imaging Data. (arXiv:2301.12176v1 [eess.IV])
    Accurate detection of brain tumors could save lots of lives and increasing the accuracy of this binary classification even as much as a few percent has high importance. Neural Gas Networks (NGN) is a fast, unsupervised algorithm that could be used in data clustering, image pattern recognition, and image segmentation. In this research, we used the metaheuristic Firefly Algorithm (FA) for image contrast enhancement as pre-processing and NGN weights for feature extraction and segmentation of Magnetic Resonance Imaging (MRI) data on two brain tumor datasets from the Kaggle platform. Also, tumor classification is conducted by Support Vector Machine (SVM) classification algorithms and compared with a deep learning technique plus other features in train and test phases. Additionally, NGN tumor segmentation is evaluated by famous performance metrics such as Accuracy, F-measure, Jaccard, and more versus ground truth data and compared with traditional segmentation techniques. The proposed method is fast and precise in both tasks of tumor classification and segmentation compared with other methods. A classification accuracy of 95.14 % and segmentation accuracy of 0.977 is achieved by the proposed method.  ( 2 min )
    Byte Pair Encoding for Symbolic Music. (arXiv:2301.11975v1 [cs.LG])
    The symbolic music modality is nowadays mostly represented as discrete and used with sequential models such as Transformers, for deep learning tasks. Recent research put efforts on the tokenization, i.e. the conversion of data into sequences of integers intelligible to such models. This can be achieved by many ways as music can be composed of simultaneous tracks, of simultaneous notes with several attributes. Until now, the proposed tokenizations are based on small vocabularies describing the note attributes and time events, resulting in fairly long token sequences. In this paper, we show how Byte Pair Encoding (BPE) can improve the results of deep learning models while improving its performances. We experiment on music generation and composer classification, and study the impact of BPE on how models learn the embeddings, and show that it can help to increase their isotropy, i.e., the uniformity of the variance of their positions in the space.  ( 2 min )
    Arbitrarily Accurate Classification Applied to Specific Emitter Identification. (arXiv:2211.10379v2 [eess.SP] UPDATED)
    This article introduces a method of evaluating subsamples until any prescribed level of classification accuracy is attained, thus obtaining arbitrary accuracy. A logarithmic reduction in error rate is obtained with a linear increase in sample count. The technique is applied to specific emitter identification on a published dataset of physically recorded over-the-air signals from 16 ostensibly identical high-performance radios. The technique uses a multi-channel deep learning convolutional neural network acting on the bispectra of I/Q signal subsamples each consisting of 56 parts per million (ppm) of the original signal duration. High levels of accuracy are obtained with minimal computation time: in this application, each addition of eight samples decreases error by one order of magnitude.  ( 2 min )
    Multi-task Highly Adaptive Lasso. (arXiv:2301.12029v1 [stat.ML])
    We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size.  ( 2 min )
    Skin Lesion Analysis: A Survey, Systematic Review, and Future Trends. (arXiv:2208.12232v2 [eess.IV] UPDATED)
    The Computer-aided Diagnosis or Detection (CAD) approach for skin lesion analysis is an emerging field of research that has the potential to alleviate the burden and cost of skin cancer screening. Researchers have recently indicated increasing interest in developing such CAD systems, with the intention of providing a user-friendly tool to dermatologists to reduce the challenges encountered or associated with manual inspection. This article aims to provide a comprehensive literature survey and review of a total of 594 publications (356 for skin lesion segmentation and 238 for skin lesion classification) published between 2011 and 2022. These articles are analyzed and summarized in a number of different ways to contribute vital information regarding the methods for the development of CAD systems. These ways include relevant and essential definitions and theories, input data (dataset utilization, preprocessing, augmentations, and fixing imbalance problems), method configuration (techniques, architectures, module frameworks, and losses), training tactics (hyperparameter settings), and evaluation criteria. We intend to investigate a variety of performance-enhancing approaches, including ensemble and post-processing. We also discuss these dimensions to reveal their current trends based on utilization frequencies. In addition, we highlight the primary difficulties associated with evaluating skin lesion segmentation and classification systems using minimal datasets, as well as the potential solutions to these difficulties. Findings, recommendations, and trends are disclosed to inform future research on developing an automated and robust CAD system for skin lesion analysis.  ( 2 min )
    Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data Distributions. (arXiv:2209.14390v4 [cs.LG] UPDATED)
    Decentralized learning over distributed datasets can have significantly different data distributions across the agents. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed. This paper focuses on improving decentralized learning over non-IID data. We propose \textit{Neighborhood Gradient Clustering (NGC)}, a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. Cross-gradients for a pair of neighboring agents are the derivatives of the model parameters of an agent with respect to the dataset of the other agent. In particular, the proposed method replaces the local gradients of the model with the weighted mean of the self-gradients, model-variant cross-gradients (derivatives of the neighbors' parameters with respect to the local dataset), and data-variant cross-gradients (derivatives of the local model with respect to its neighbors' datasets). The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints. Further, we present \textit{CompNGC}, a compressed version of \textit{NGC} that reduces the communication overhead by $32 \times$. We theoretically analyze the convergence rate of the proposed algorithm and demonstrate its efficiency over non-IID data sampled from {various vision and language} datasets trained. Our experiments demonstrate that \textit{NGC} and \textit{CompNGC} outperform (by $0-6\%$) the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements. Further, our experiments show that the model-variant cross-gradient information available locally at each agent can improve the performance over non-IID data by $1-35\%$ without additional communication cost.  ( 3 min )
    Principled Acceleration of Iterative Numerical Methods Using Machine Learning. (arXiv:2206.08594v2 [math.NA] UPDATED)
    Iterative methods are ubiquitous in large-scale scientific computing applications, and a number of approaches based on meta-learning have been recently proposed to accelerate them. However, a systematic study of these approaches and how they differ from meta-learning is lacking. In this paper, we propose a framework to analyze such learning-based acceleration approaches, where one can immediately identify a departure from classical meta-learning. We show that this departure may lead to arbitrary deterioration of model performance. Based on our analysis, we introduce a novel training method for learning-based acceleration of iterative methods. Furthermore, we theoretically prove that the proposed method improves upon the existing methods, and demonstrate its significant advantage and versatility through various numerical applications.  ( 2 min )
    Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. (arXiv:2301.11936v1 [quant-ph])
    Ridgelet transform has been a fundamental mathematical tool in the theoretical studies of neural networks. However, the practical applicability of ridgelet transform to conducting learning tasks was limited since its numerical implementation by conventional classical computation requires an exponential runtime $\exp(O(D))$ as data dimension $D$ increases. To address this problem, we develop a quantum ridgelet transform (QRT), which implements the ridgelet transform of a quantum state within a linear runtime $O(D)$ of quantum computation. As an application, we also show that one can use QRT as a fundamental subroutine for quantum machine learning (QML) to efficiently find a sparse trainable subnetwork of large shallow wide neural networks without conducting large-scale optimization of the original network. This application discovers an efficient way in this regime to demonstrate the lottery ticket hypothesis on finding such a sparse trainable neural network. These results open an avenue of QML for accelerating learning tasks with commonly used classical neural networks.  ( 2 min )
    Supervision Complexity and its Role in Knowledge Distillation. (arXiv:2301.12245v1 [cs.LG])
    Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.  ( 2 min )
    Microstructural parameter estimation using spherical convolutional neural networks. (arXiv:2211.09887v2 [eess.IV] UPDATED)
    Diffusion-weighted magnetic resonance imaging is sensitive to the microstructural properties of brain tissue. However, estimating clinically and scientifically relevant microstructural properties from the measured signals remains a highly challenging inverse problem that deep learning may help solve. This study investigated if recently developed orientationally invariant spherical convolutional neural networks can improve microstructural parameter estimation. A spherical convolutional neural network was trained to predict the ground-truth parameter values from simulated noisy data and applied to imaging data acquired in a clinical setting to generate microstructural parameter maps. The spherical convolutional neural network was more accurate and less orientationally variant than the benchmark methods (multi-layer perceptrons and the spherical mean technique). Our results show that spherical convolutional neural networks can be a compelling alternative to predicting parameters from powder-averaged data (i.e., data averaged over the acquired diffusion encoding directions). While we focused on constrained two- and three-compartment models of neuronal tissue, the presented network and training pipeline are generalizable and can be used to estimate the parameters of other Gaussian compartment models.  ( 2 min )
    Flip Initial Features: Generalization of Neural Networks Under Sparse Features for Semi-supervised Node Classification. (arXiv:2211.15081v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been widely used under semi-supervised settings. Prior studies have mainly focused on finding appropriate graph filters (e.g., aggregation schemes) to generalize well for both homophilic and heterophilic graphs. Even though these approaches are essential and effective, they still suffer from the sparsity in initial node features inherent in the bag-of-words representation. Common in semi-supervised learning where the training samples often fail to cover the entire dimensions of graph filters (hyperplanes), this can precipitate over-fitting of specific dimensions in the first projection matrix. To deal with this problem, we suggest a simple and novel strategy; create additional space by flipping the initial features and hyperplane simultaneously. Training in both the original and in the flip space can provide precise updates of learnable parameters. To the best of our knowledge, this is the first attempt that effectively moderates the overfitting problem in GNN. Extensive experiments on real-world datasets demonstrate that the proposed technique improves the node classification accuracy up to 40.2 %  ( 2 min )
    Neural Error Mitigation of Near-Term Quantum Simulations. (arXiv:2105.08086v2 [quant-ph] UPDATED)
    Near-term quantum computers provide a promising platform for finding ground states of quantum systems, which is an essential task in physics, chemistry, and materials science. Near-term approaches, however, are constrained by the effects of noise as well as the limited resources of near-term quantum hardware. We introduce "neural error mitigation," which uses neural networks to improve estimates of ground states and ground-state observables obtained using near-term quantum simulations. To demonstrate our method's broad applicability, we employ neural error mitigation to find the ground states of the H$_2$ and LiH molecular Hamiltonians, as well as the lattice Schwinger model, prepared via the variational quantum eigensolver (VQE). Our results show that neural error mitigation improves numerical and experimental VQE computations to yield low energy errors, high fidelities, and accurate estimations of more-complex observables like order parameters and entanglement entropy, without requiring additional quantum resources. Furthermore, neural error mitigation is agnostic with respect to the quantum state preparation algorithm used, the quantum hardware it is implemented on, and the particular noise channel affecting the experiment, contributing to its versatility as a tool for quantum simulation.  ( 2 min )
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v4 [stat.ME] UPDATED)
    This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, \mu, c_{\cX})$ and $(\cY, \nu, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.  ( 2 min )
    Cross-Subject Deep Transfer Models for Evoked Potentials in Brain-Computer Interface. (arXiv:2301.12322v1 [cs.LG])
    Brain Computer Interface (BCI) technologies have the potential to improve the lives of millions of people around the world, whether through assistive technologies or clinical diagnostic tools. Despite advancements in the field, however, at present consumer and clinical viability remains low. A key reason for this is that many of the existing BCI deployments require substantial data collection per end-user, which can be cumbersome, tedious, and error-prone to collect. We address this challenge via a deep learning model, which, when trained across sufficient data from multiple subjects, offers reasonable performance out-of-the-box, and can be customized to novel subjects via a transfer learning process. We demonstrate the fundamental viability of our approach by repurposing an older but well-curated electroencephalography (EEG) dataset and benchmarking against several common approaches/techniques. We then partition this dataset into a transfer learning benchmark and demonstrate that our approach significantly reduces data collection burden per-subject. This suggests that our model and methodology may yield improvements to BCI technologies and enhance their consumer/clinical viability.  ( 2 min )
    DALI: Dynamically Adjusted Label Importance for Noisy Partial Label Learning. (arXiv:2301.12077v1 [cs.CV])
    Noisy partial label learning (noisy PLL) is an important branch of weakly supervised learning. Unlike PLL where the ground-truth label must reside in the candidate set, noisy PLL relaxes this constraint and allows the ground-truth label may not be in the candidate set. To address this problem, existing works attempt to detect noisy samples and estimate the ground-truth label for each noisy sample. However, detection errors are inevitable, and these errors will accumulate during training and continuously affect model optimization. To address this challenge, we propose a novel framework for noisy PLL, called ``Dynamically Adjusted Label Importance (DALI)''. It aims to reduce the negative impact of detection errors by trading off the initial candidate set and model outputs with theoretical guarantees. Experimental results on multiple datasets demonstrate that our DALI succeeds over existing state-of-the-art approaches on noisy PLL. Our code will soon be publicly available.  ( 2 min )
    Gradient Shaping: Enhancing Backdoor Attack Against Reverse Engineering. (arXiv:2301.12318v1 [cs.CR])
    Most existing methods to detect backdoored machine learning (ML) models take one of the two approaches: trigger inversion (aka. reverse engineer) and weight analysis (aka. model diagnosis). In particular, the gradient-based trigger inversion is considered to be among the most effective backdoor detection techniques, as evidenced by the TrojAI competition, Trojan Detection Challenge and backdoorBench. However, little has been done to understand why this technique works so well and, more importantly, whether it raises the bar to the backdoor attack. In this paper, we report the first attempt to answer this question by analyzing the change rate of the backdoored model around its trigger-carrying inputs. Our study shows that existing attacks tend to inject the backdoor characterized by a low change rate around trigger-carrying inputs, which are easy to capture by gradient-based trigger inversion. In the meantime, we found that the low change rate is not necessary for a backdoor attack to succeed: we design a new attack enhancement called \textit{Gradient Shaping} (GRASP), which follows the opposite direction of adversarial training to reduce the change rate of a backdoored model with regard to the trigger, without undermining its backdoor effect. Also, we provide a theoretic analysis to explain the effectiveness of this new technique and the fundamental weakness of gradient-based trigger inversion. Finally, we perform both theoretical and experimental analysis, showing that the GRASP enhancement does not reduce the effectiveness of the stealthy attacks against the backdoor detection methods based on weight analysis, as well as other backdoor mitigation methods without using detection.  ( 2 min )
    Deciphering the Projection Head: Representation Evaluation Self-supervised Learning. (arXiv:2301.12189v1 [cs.LG])
    Self-supervised learning (SSL) aims to learn intrinsic features without labels. Despite the diverse architectures of SSL methods, the projection head always plays an important role in improving the performance of the downstream task. In this work, we systematically investigate the role of the projection head in SSL. Specifically, the projection head targets the uniformity part of SSL, which pushes the dissimilar samples away from each other, thus enabling the encoder to focus on extracting semantic features. Based on this understanding, we propose a Representation Evaluation Design (RED) in SSL models in which a shortcut connection between the representation and the projection vectors is built. Extensive experiments with different architectures, including SimCLR, MoCo-V2, and SimSiam, on various datasets, demonstrate that the representation evaluation design can consistently improve the baseline models in the downstream tasks. The learned representation from the RED-SSL models shows superior robustness to unseen augmentations and out-of-distribution data.  ( 2 min )
    In-Distribution Barrier Functions: Self-Supervised Policy Filters that Avoid Out-of-Distribution States. (arXiv:2301.12012v1 [cs.RO])
    Learning-based control approaches have shown great promise in performing complex tasks directly from high-dimensional perception data for real robotic systems. Nonetheless, the learned controllers can behave unexpectedly if the trajectories of the system divert from the training data distribution, which can compromise safety. In this work, we propose a control filter that wraps any reference policy and effectively encourages the system to stay in-distribution with respect to offline-collected safe demonstrations. Our methodology is inspired by Control Barrier Functions (CBFs), which are model-based tools from the nonlinear control literature that can be used to construct minimally invasive safe policy filters. While existing methods based on CBFs require a known low-dimensional state representation, our proposed approach is directly applicable to systems that rely solely on high-dimensional visual observations by learning in a latent state-space. We demonstrate that our method is effective for two different visuomotor control tasks in simulation environments, including both top-down and egocentric view settings.  ( 2 min )
    Learning Optimal Features via Partial Invariance. (arXiv:2301.12067v1 [cs.LG])
    Learning models that are robust to test-time distribution shifts is a key concern in domain generalization, and in the wider context of their real-life applicability. Invariant Risk Minimization (IRM) is one particular framework that aims to learn deep invariant features from multiple domains and has subsequently led to further variants. A key assumption for the success of these methods requires that the underlying causal mechanisms/features remain invariant across domains and the true invariant features be sufficient to learn the optimal predictor. In practical problem settings, these assumptions are often not satisfied, which leads to IRM learning a sub-optimal predictor for that task. In this work, we propose the notion of partial invariance as a relaxation of the IRM framework. Under our problem setting, we first highlight the sub-optimality of the IRM solution. We then demonstrate how partitioning the training domains, assuming access to some meta-information about the domains, can help improve the performance of invariant models via partial invariance. Finally, we conduct several experiments, both in linear settings as well as with classification tasks in language and images with deep models, which verify our conclusions.  ( 2 min )
    Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. (arXiv:2301.12083v1 [cs.LG])
    Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.  ( 2 min )
    Using uncertainty-aware machine learning models to study aerosol-cloud interactions. (arXiv:2301.11921v1 [physics.data-an])
    Aerosol-cloud interactions (ACI) include various effects that result from aerosols entering a cloud, and affecting cloud properties. In general, an increase in aerosol concentration results in smaller droplet sizes which leads to larger, brighter, longer-lasting clouds that reflect more sunlight and cool the Earth. The strength of the effect is however heterogeneous, meaning it depends on the surrounding environment, making ACI one of the most uncertain effects in our current climate models. In our work, we use causal machine learning to estimate ACI from satellite observations by reframing the problem as a treatment (aerosol) and outcome (change in droplet radius). We predict the causal effect of aerosol on clouds with uncertainty bounds depending on the unknown factors that may be influencing the impact of aerosol. Of the three climate models evaluated, we find that only one plausibly recreates the trend, lending more credence to its estimate cooling due to ACI.  ( 2 min )
    Statistical whitening of neural populations with gain-modulating interneurons. (arXiv:2301.11955v1 [q-bio.NC])
    Statistical whitening transformations play a fundamental role in many computational systems, and may also play an important role in biological sensory systems. Individual neurons appear to rapidly and reversibly alter their input-output gains, approximately normalizing the variance of their responses. Populations of neurons appear to regulate their joint responses, reducing correlations between neural activities. It is natural to see whitening as the objective that guides these behaviors, but the mechanism for such joint changes is unknown, and direct adjustment of synaptic interactions would seem to be both too slow, and insufficiently reversible. Motivated by the extensive neuroscience literature on rapid gain modulation, we propose a recurrent network architecture in which joint whitening is achieved through modulation of gains within the circuit. Specifically, we derive an online statistical whitening algorithm that regulates the joint second-order statistics of a multi-dimensional input by adjusting the marginal variances of an overcomplete set of interneuron projections. The gains of these interneurons are adjusted individually, using only local signals, and feed back onto the primary neurons. The network converges to a state in which the responses of the primary neurons are whitened. We demonstrate through simulations that the behavior of the network is robust to poor conditioning or noise when the gains are sign-constrained, and can be generalized to achieve a form of local whitening in convolutional populations, such as those found throughout the visual or auditory system.  ( 2 min )
    Towards Learning Rubik's Cube with N-tuple-based Reinforcement Learning. (arXiv:2301.12167v1 [cs.LG])
    This work describes in detail how to learn and solve the Rubik's cube game (or puzzle) in the General Board Game (GBG) learning and playing framework. We cover the cube sizes 2x2x2 and 3x3x3. We describe in detail the cube's state representation, how to transform it with twists, whole-cube rotations and color transformations and explain the use of symmetries in Rubik's cube. Next, we discuss different n-tuple representations for the cube, how we train the agents by reinforcement learning and how we improve the trained agents during evaluation by MCTS wrapping. We present results for agents that learn Rubik's cube from scratch, with and without MCTS wrapping, with and without symmetries and show that both, MCTS wrapping and symmetries, increase computational costs, but lead at the same time to much better results. We can solve the 2x2x2 cube completely, and the 3x3x3 cube in the majority of the cases for scrambled cubes up to p = 15 (QTM). We cannot yet reliably solve 3x3x3 cubes with more than 15 scrambling twists. Although our computational costs are higher with MCTS wrapping and with symmetries than without, they are still considerably lower than in the approaches of McAleer et al. (2018, 2019) and Agostinelli et al. (2019) who provide the best Rubik's cube learning agents so far.  ( 2 min )
  • Open

    Theoretical Perspectives on Deep Learning Methods in Inverse Problems. (arXiv:2206.14373v2 [stat.ML] UPDATED)
    In recent years, there have been significant advances in the use of deep learning methods in inverse problems such as denoising, compressive sensing, inpainting, and super-resolution. While this line of works has predominantly been driven by practical algorithms and experiments, it has also given rise to a variety of intriguing theoretical problems. In this paper, we survey some of the prominent theoretical developments in this line of works, focusing in particular on generative priors, untrained neural network priors, and unfolding algorithms. In addition to summarizing existing results in these topics, we highlight several ongoing challenges and open problems.  ( 2 min )
    Asymptotic Inference for Multi-Stage Stationary Treatment Policy with High Dimensional Features. (arXiv:2301.12553v1 [stat.ML])
    Dynamic treatment rules or policies are a sequence of decision functions over multiple stages that are tailored to individual features. One important class of treatment policies for practice, namely multi-stage stationary treatment policies, prescribe treatment assignment probabilities using the same decision function over stages, where the decision is based on the same set of features consisting of both baseline variables (e.g., demographics) and time-evolving variables (e.g., routinely collected disease biomarkers). Although there has been extensive literature to construct valid inference for the value function associated with the dynamic treatment policies, little work has been done for the policies themselves, especially in the presence of high dimensional feature variables. We aim to fill in the gap in this work. Specifically, we first estimate the multistage stationary treatment policy based on an augmented inverse probability weighted estimator for the value function to increase the asymptotic efficiency, and further apply a penalty to select important feature variables. We then construct one-step improvement of the policy parameter estimators. Theoretically, we show that the improved estimators are asymptotically normal, even if nuisance parameters are estimated at a slow convergence rate and the dimension of the feature variables increases exponentially with the sample size. Our numerical studies demonstrate that the proposed method has satisfactory performance in small samples, and that the performance can be improved with a choice of the augmentation term that approximates the rewards or minimizes the variance of the value function.  ( 2 min )
    Solving high-dimensional Hamilton-Jacobi-Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. (arXiv:2005.05409v2 [math.OC] UPDATED)
    Optimal control of diffusion processes is intimately connected to the problem of solving certain Hamilton-Jacobi-Bellman equations. Building on recent machine learning inspired approaches towards high-dimensional PDEs, we investigate the potential of $\textit{iterative diffusion optimisation}$ techniques, in particular considering applications in importance sampling and rare event simulation, and focusing on problems without diffusion control, with linearly controlled drift and running costs that depend quadratically on the control. More generally, our methods apply to nonlinear parabolic PDEs with a certain shift invariance. The choice of an appropriate loss function being a central element in the algorithmic design, we develop a principled framework based on divergences between path measures, encompassing various existing methods. Motivated by connections to forward-backward SDEs, we propose and study the novel $\textit{log-variance}$ divergence, showing favourable properties of corresponding Monte Carlo estimators. The promise of the developed approach is exemplified by a range of high-dimensional and metastable numerical examples.  ( 2 min )
    Distributed Stochastic Optimization under a General Variance Condition. (arXiv:2301.12677v1 [math.OC])
    Distributed stochastic optimization has drawn great attention recently due to its effectiveness in solving large-scale machine learning problems. However, despite that numerous algorithms have been proposed with empirical successes, their theoretical guarantees are restrictive and rely on certain boundedness conditions on the stochastic gradients, varying from uniform boundedness to the relaxed growth condition. In addition, how to characterize the data heterogeneity among the agents and its impacts on the algorithmic performance remains challenging. In light of such motivations, we revisit the classical FedAvg algorithm for solving the distributed stochastic optimization problem and establish the convergence results under only a mild variance condition on the stochastic gradients for smooth nonconvex objective functions. Almost sure convergence to a stationary point is also established under the condition. Moreover, we discuss a more informative measurement for data heterogeneity as well as its implications.  ( 2 min )
    Robust Stochastic Linear Contextual Bandits Under Adversarial Attacks. (arXiv:2106.02978v3 [stat.ML] UPDATED)
    Stochastic linear contextual bandit algorithms have substantial applications in practice, such as recommender systems, online advertising, clinical trials, etc. Recent works show that optimal bandit algorithms are vulnerable to adversarial attacks and can fail completely in the presence of attacks. Existing robust bandit algorithms only work for the non-contextual setting under the attack of rewards and cannot improve the robustness in the general and popular contextual bandit environment. In addition, none of the existing methods can defend against attacked context. In this work, we provide the first robust bandit algorithm for stochastic linear contextual bandit setting under a fully adaptive and omniscient attack with sub-linear regret. Our algorithm not only works under the attack of rewards, but also under attacked context. Moreover, it does not need any information about the attack budget or the particular form of the attack. We provide theoretical guarantees for our proposed algorithm and show by experiments that our proposed algorithm improves the robustness against various kinds of popular attacks.  ( 2 min )
    Singularity-aware Reinforcement Learning. (arXiv:2301.13152v1 [stat.ML])
    Batch reinforcement learning (RL) aims at finding an optimal policy in a dynamic environment in order to maximize the expected total rewards by leveraging pre-collected data. A fundamental challenge behind this task is the distributional mismatch between the batch data generating process and the distribution induced by target policies. Nearly all existing algorithms rely on the absolutely continuous assumption on the distribution induced by target policies with respect to the data distribution so that the batch data can be used to calibrate target policies via the change of measure. However, the absolute continuity assumption could be violated in practice, especially when the state-action space is large or continuous. In this paper, we propose a new batch RL algorithm without requiring absolute continuity in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable the power of model extrapolation. By leveraging the idea of pessimism and under some mild conditions, we derive a finite-sample regret guarantee for our proposed algorithm without imposing absolute continuity. Compared with existing algorithms, STEEL only requires some minimal data-coverage assumption and thus greatly enhances the applicability and robustness of batch RL. Extensive simulation studies and one real experiment on personalized pricing demonstrate the superior performance of our method when facing possible singularity in batch RL.  ( 2 min )
    Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning. (arXiv:2301.12593v1 [cs.LG])
    Many real-world domains require safe decision making in the presence of uncertainty. In this work, we propose a deep reinforcement learning framework for approaching this important problem. We consider a risk-averse perspective towards model uncertainty through the use of coherent distortion risk measures, and we show that our formulation is equivalent to a distributionally robust safe reinforcement learning problem with robustness guarantees on performance and safety. We propose an efficient implementation that only requires access to a single training environment, and we demonstrate that our framework produces robust, safe performance on a variety of continuous control tasks with safety constraints in the Real-World Reinforcement Learning Suite.  ( 2 min )
    MetaStackVis: Visually-Assisted Performance Evaluation of Metamodels. (arXiv:2212.03539v2 [cs.LG] UPDATED)
    Stacking (or stacked generalization) is an ensemble learning method with one main distinctiveness from the rest: even though several base models are trained on the original data set, their predictions are further used as input data for one or more metamodels arranged in at least one extra layer. Composing a stack of models can produce high-performance outcomes, but it usually involves a trial-and-error process. Therefore, our previously developed visual analytics system, StackGenVis, was mainly designed to assist users in choosing a set of top-performing and diverse models by measuring their predictive performance. However, it only employs a single logistic regression metamodel. In this paper, we investigate the impact of alternative metamodels on the performance of stacking ensembles using a novel visualization tool, called MetaStackVis. Our interactive tool helps users to visually explore different singular and pairs of metamodels according to their predictive probabilities and multiple validation metrics, as well as their ability to predict specific problematic data instances. MetaStackVis was evaluated with a usage scenario based on a medical data set and via expert interviews.  ( 2 min )
    Deep Riemannian Networks for EEG Decoding. (arXiv:2212.10426v3 [cs.LG] UPDATED)
    State-of-the-art performance in electroencephalography (EEG) decoding tasks is currently often achieved with either Deep-Learning or Riemannian-Geometry-based decoders. Recently, there is growing interest in Deep Riemannian Networks (DRNs) possibly combining the advantages of both previous classes of methods. However, there are still a range of topics where additional insight is needed to pave the way for a more widespread application of DRNs in EEG. These include architecture design questions such as network size and end-to-end ability as well as model training questions. How these factors affect model performance has not been explored. Additionally, it is not clear how the data within these networks is transformed, and whether this would correlate with traditional EEG decoding. Our study aims to lay the groundwork in the area of these topics through the analysis of DRNs for EEG with a wide range of hyperparameters. Networks were tested on two public EEG datasets and compared with state-of-the-art ConvNets. Here we propose end-to-end EEG SPDNet (EE(G)-SPDNet), and we show that this wide, end-to-end DRN can outperform the ConvNets, and in doing so use physiologically plausible frequency regions. We also show that the end-to-end approach learns more complex filters than traditional band-pass filters targeting the classical alpha, beta, and gamma frequency bands of the EEG, and that performance can benefit from channel specific filtering approaches. Additionally, architectural analysis revealed areas for further improvement due to the possible loss of Riemannian specific information throughout the network. Our study thus shows how to design and train DRNs to infer task-related information from the raw EEG without the need of handcrafted filterbanks and highlights the potential of end-to-end DRNs such as EE(G)-SPDNet for high-performance EEG decoding.  ( 3 min )
    Thompson Sampling for High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2211.05964v2 [stat.ML] UPDATED)
    We consider the stochastic linear contextual bandit problem with high-dimensional features. We analyze the Thompson sampling algorithm using special classes of sparsity-inducing priors (e.g., spike-and-slab) to model the unknown parameter and provide a nearly optimal upper bound on the expected cumulative regret. To the best of our knowledge, this is the first work that provides theoretical guarantees of Thompson sampling in high-dimensional and sparse contextual bandits. For faster computation, we use variational inference instead of Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution. Extensive simulations demonstrate the improved performance of our proposed algorithm over existing ones.  ( 2 min )
    Inequality Constrained Stochastic Nonlinear Optimization via Active-Set Sequential Quadratic Programming. (arXiv:2109.11502v3 [math.OC] UPDATED)
    We study nonlinear optimization problems with a stochastic objective and deterministic equality and inequality constraints, which emerge in numerous applications including finance, manufacturing, power systems and, recently, deep neural networks. We propose an active-set stochastic sequential quadratic programming (StoSQP) algorithm that utilizes a differentiable exact augmented Lagrangian as the merit function. The algorithm adaptively selects the penalty parameters of the augmented Lagrangian and performs a stochastic line search to decide the stepsize. The global convergence is established: for any initialization, the KKT residuals converge to zero almost surely. Our algorithm and analysis further develop the prior work of Na et al., (2022). Specifically, we allow nonlinear inequality constraints without requiring the strict complementary condition; refine some of the designs in Na et al., (2022) such as the feasibility error condition and the monotonically increasing sample size; strengthen the global convergence guarantee; and improve the sample complexity on the objective Hessian. We demonstrate the performance of the designed algorithm on a subset of nonlinear problems collected in CUTEst test set and on constrained logistic regression problems.  ( 2 min )
    Generalization on the Unseen, Logic Reasoning and Degree Curriculum. (arXiv:2301.13105v1 [cs.LG])
    This paper considers the learning of logical (Boolean) functions with focus on the generalization on the unseen (GOTU) setting, a strong case of out-of-distribution generalization. This is motivated by the fact that the rich combinatorial nature of data in certain reasoning tasks (e.g., arithmetic/logic) makes representative data sampling challenging, and learning successfully under GOTU gives a first vignette of an 'extrapolating' or 'reasoning' learner. We then study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for a class of network models including instances of Transformers, random features models, and diagonal linear networks, a min-degree-interpolator (MDI) is learned on the unseen. We also provide evidence that other instances with larger learning rates or mean-field networks reach leaky MDIs. These findings lead to two implications: (1) we provide an explanation to the length generalization problem (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports.  ( 2 min )
    Mirror Sinkhorn: Fast Online Optimization on Transport Polytopes. (arXiv:2211.10420v2 [cs.LG] UPDATED)
    Optimal transport is an important tool in machine learning, allowing to capture geometric properties of the data through a linear program on transport polytopes. We present a single-loop optimization algorithm for minimizing general convex objectives on these domains, utilizing the principles of Sinkhorn matrix scaling and mirror descent. The proposed algorithm is robust to noise, and can be used in an online setting. We provide theoretical guarantees for convex objectives and experimental results showcasing it effectiveness on both synthetic and real-world data.  ( 2 min )
    SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising. (arXiv:2105.07911v2 [cs.CL] UPDATED)
    In text-to-SQL task, seq-to-seq models often lead to sub-optimal performance due to limitations in their architecture. In this paper, we present a simple yet effective approach that adapts transformer-based seq-to-seq model to robust text-to-SQL generation. Instead of inducing constraint to decoder or reformat the task as slot-filling, we propose to train seq-to-seq model with Schema aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These denoising objectives acts as the auxiliary tasks for better modeling the structural data in S2S generation. In addition, we improve and propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of seq-to-seq model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. The results indicate that the capacity of vanilla seq-to-seq architecture for text-to-SQL may have been under-estimated.  ( 2 min )
    Jump Interval-Learning for Individualized Decision Making. (arXiv:2111.08885v2 [stat.ME] UPDATED)
    An individualized decision rule (IDR) is a decision function that assigns each individual a given treatment based on his/her observed characteristics. Most of the existing works in the literature consider settings with binary or finitely many treatment options. In this paper, we focus on the continuous treatment setting and propose a jump interval-learning to develop an individualized interval-valued decision rule (I2DR) that maximizes the expected outcome. Unlike IDRs that recommend a single treatment, the proposed I2DR yields an interval of treatment options for each individual, making it more flexible to implement in practice. To derive an optimal I2DR, our jump interval-learning method estimates the conditional mean of the outcome given the treatment and the covariates via jump penalized regression, and derives the corresponding optimal I2DR based on the estimated outcome regression function. The regressor is allowed to be either linear for clear interpretation or deep neural network to model complex treatment-covariates interactions. To implement jump interval-learning, we develop a searching algorithm based on dynamic programming that efficiently computes the outcome regression function. Statistical properties of the resulting I2DR are established when the outcome regression function is either a piecewise or continuous function over the treatment space. We further develop a procedure to infer the mean outcome under the (estimated) optimal policy. Extensive simulations and a real data application to a warfarin study are conducted to demonstrate the empirical validity of the proposed I2DR.  ( 2 min )
    Doubly Robust Interval Estimation for Optimal Policy Evaluation in Online Learning. (arXiv:2110.15501v3 [stat.ML] UPDATED)
    Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine and economics, to provide crucial instruction on the early-stop of the online experiment and timely feedback from the environment. Policy evaluation in online learning thus attracts increasing attention by inferring the mean outcome of the optimal policy (i.e., the value) in real-time. Yet, such a problem is particularly challenging due to the dependent data generated in the online environment, the unknown optimal policy, and the complex exploration and exploitation trade-off in the adaptive experiment. In this paper, we aim to overcome these difficulties in policy evaluation for online learning. We explicitly derive the probability of exploration that quantifies the probability of exploring the non-optimal actions under commonly used bandit algorithms. We use this probability to conduct valid inference on the online conditional mean estimator under each action and develop the doubly robust interval estimation (DREAM) method to infer the value under the estimated optimal policy in online learning. The proposed value estimator provides double protection on the consistency and is asymptotically normal with a Wald-type confidence interval provided. Extensive simulations and real data applications are conducted to demonstrate the empirical validity of the proposed DREAM method.  ( 2 min )
    Spherical Sliced-Wasserstein. (arXiv:2206.08780v2 [stat.ML] UPDATED)
    Many variants of the Wasserstein distance have been introduced to reduce its original computational burden. In particular the Sliced-Wasserstein distance (SW), which leverages one-dimensional projections for which a closed-form solution of the Wasserstein distance is available, has received a lot of interest. Yet, it is restricted to data living in Euclidean spaces, while the Wasserstein distance has been studied and used recently on manifolds. We focus more specifically on the sphere, for which we define a novel SW discrepancy, which we call spherical Sliced-Wasserstein, making a first step towards defining SW discrepancies on manifolds. Our construction is notably based on closed-form solutions of the Wasserstein distance on the circle, together with a new spherical Radon transform. Along with efficient algorithms and the corresponding implementations, we illustrate its properties in several machine learning use cases where spherical representations of data are at stake: sampling on the sphere, density estimation on real earth data or hyperspherical auto-encoders.  ( 2 min )
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v3 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.  ( 2 min )
    A semi-agnostic ansatz with variable structure for quantum machine learning. (arXiv:2103.06712v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) offers a powerful, flexible paradigm for programming near-term quantum computers, with applications in chemistry, metrology, materials science, data science, and mathematics. Here, one trains an ansatz, in the form of a parameterized quantum circuit, to accomplish a task of interest. However, challenges have recently emerged suggesting that deep ansatzes are difficult to train, due to flat training landscapes caused by randomness or by hardware noise. This motivates our work, where we present a variable structure approach to build ansatzes for QML. Our approach, called VAns (Variable Ansatz), applies a set of rules to both grow and (crucially) remove quantum gates in an informed manner during the optimization. Consequently, VAns is ideally suited to mitigate trainability and noise-related issues by keeping the ansatz shallow. We employ VAns in the variational quantum eigensolver for condensed matter and quantum chemistry applications, in the quantum autoencoder for data compression and in unitary compilation problems showing successful results in all cases.  ( 2 min )
    Online Self-Concordant and Relatively Smooth Minimization, With Applications to Online Portfolio Selection and Learning Quantum States. (arXiv:2210.00997v2 [stat.ML] UPDATED)
    Consider an online convex optimization problem where the loss functions are self-concordant barriers, smooth relative to a convex function $h$, and possibly non-Lipschitz. We analyze the regret of online mirror descent with $h$. Then, based on the result, we prove the following in a unified manner. Denote by $T$ the time horizon and $d$ the parameter dimension. 1. For online portfolio selection, the regret of $\widetilde{\text{EG}}$, a variant of exponentiated gradient due to Helmbold et al., is $\tilde{O} ( T^{2/3} d^{1/3} )$ when $T > 4 d / \log d$. This improves on the original $\tilde{O} ( T^{3/4} d^{1/2} )$ regret bound for $\widetilde{\text{EG}}$. 2. For online portfolio selection, the regret of online mirror descent with the logarithmic barrier is $\tilde{O}(\sqrt{T d})$. The regret bound is the same as that of Soft-Bayes due to Orseau et al. up to logarithmic terms. 3. For online learning quantum states with the logarithmic loss, the regret of online mirror descent with the log-determinant function is also $\tilde{O} ( \sqrt{T d} )$. Its per-iteration time is shorter than all existing algorithms we know.  ( 2 min )
    Integrating Earth Observation Data into Causal Inference: Challenges and Opportunities. (arXiv:2301.12985v1 [stat.ML])
    Observational studies require adjustment for confounding factors that are correlated with both the treatment and outcome. In the setting where the observed variables are tabular quantities such as average income in a neighborhood, tools have been developed for addressing such confounding. However, in many parts of the developing world, features about local communities may be scarce. In this context, satellite imagery can play an important role, serving as a proxy for the confounding variables otherwise unobserved. In this paper, we study confounder adjustment in this non-tabular setting, where patterns or objects found in satellite images contribute to the confounder bias. Using the evaluation of anti-poverty aid programs in Africa as our running example, we formalize the challenge of performing causal adjustment with such unstructured data -- what conditions are sufficient to identify causal effects, how to perform estimation, and how to quantify the ways in which certain aspects of the unstructured image object are most predictive of the treatment decision. Via simulation, we also explore the sensitivity of satellite image-based observational inference to image resolution and to misspecification of the image-associated confounder. Finally, we apply these tools in estimating the effect of anti-poverty interventions in African communities from satellite imagery.  ( 2 min )
    Lossy Image Compression with Conditional Diffusion Models. (arXiv:2209.06950v4 [eess.IV] UPDATED)
    This paper outlines an end-to-end optimized lossy image compression framework using diffusion generative models. The approach relies on the transform coding paradigm, where an image is mapped into a latent space for entropy coding and, from there, mapped back to the data space for reconstruction. In contrast to VAE-based neural compression, where the (mean) decoder is a deterministic neural network, our decoder is a conditional diffusion model. Our approach thus introduces an additional ``content'' latent variable on which the reverse diffusion process is conditioned and uses this variable to store information about the image. The remaining ``texture'' latent variables characterizing the diffusion process are synthesized (stochastically or deterministically) at decoding time. We show that the model's performance can be tuned toward perceptual metrics of interest. Our extensive experiments involving five datasets and sixteen image quality assessment metrics show that our approach yields the strongest reported FID scores while also yielding competitive performance with state-of-the-art models in several SIM-based reference metrics.  ( 2 min )
    Data Heterogeneity Differential Privacy: From Theory to Algorithm. (arXiv:2002.08578v2 [cs.LG] UPDATED)
    Traditionally, the random noise is equally injected when training with different data instances in the field of differential privacy (DP). In this paper, we first give sharper excess risk bounds of DP stochastic gradient descent (SGD) method. Considering most of the previous methods are under convex conditions, we use Polyak-{\L}ojasiewicz condition to relax it in this paper. Then, after observing that different training data instances affect the machine learning model to different extent, we consider the heterogeneity of training data and attempt to improve the performance of DP-SGD from a new perspective. Specifically, by introducing the influence function (IF), we quantitatively measure the contributions of various training data on the final machine learning model. If the contribution made by a single data instance is so little that attackers cannot infer anything from the model, we do not add noise when training with it. Based on this observation, we design a `Performance Improving' DP-SGD algorithm: PIDP-SGD. Theoretical and experimental results show that our proposed PIDP-SGD improves the performance significantly.  ( 2 min )
    Efficient functional estimation and the super-oracle phenomenon. (arXiv:1904.09347v2 [math.ST] UPDATED)
    We consider the estimation of two-sample integral functionals, of the type that occur naturally, for example, when the object of interest is a divergence between unknown probability densities. Our first main result is that, in wide generality, a weighted nearest neighbour estimator is efficient, in the sense of achieving the local asymptotic minimax lower bound. Moreover, we also prove a corresponding central limit theorem, which facilitates the construction of asymptotically valid confidence intervals for the functional, having asymptotically minimal width. One interesting consequence of our results is the discovery that, for certain functionals, the worst-case performance of our estimator may improve on that of the natural `oracle' estimator, which is given access to the values of the unknown densities at the observations.  ( 2 min )
    Policy Targeting under Network Interference. (arXiv:1906.10258v13 [econ.EM] UPDATED)
    This paper studies the problem of optimally allocating treatments in the presence of spillover effects, using information from a (quasi-)experiment. I introduce a method that maximizes the sample analog of average social welfare when spillovers occur. I construct semi-parametric welfare estimators with known and unknown propensity scores and cast the optimization problem into a mixed-integer linear program, which can be solved using off-the-shelf algorithms. I derive a strong set of guarantees on regret, i.e., the difference between the maximum attainable welfare and the welfare evaluated at the estimated policy. The proposed method presents attractive features for applications: (i) it does not require network information of the target population; (ii) it exploits heterogeneity in treatment effects for targeting individuals; (iii) it does not rely on the correct specification of a particular structural model; and (iv) it accommodates constraints on the policy function. An application for targeting information on social networks illustrates the advantages of the method.  ( 2 min )
    Learning Mixtures of Markov Chains and MDPs. (arXiv:2211.09403v2 [stat.ML] UPDATED)
    We present an algorithm for learning mixtures of Markov chains and Markov decision processes (MDPs) from short unlabeled trajectories. Specifically, our method handles mixtures of Markov chains with optional control input by going through a multi-step process, involving (1) a subspace estimation step, (2) spectral clustering of trajectories using "pairwise distance estimators," along with refinement using the EM algorithm, (3) a model estimation step, and (4) a classification step for predicting labels of new trajectories. We provide end-to-end performance guarantees, where we only explicitly require the length of trajectories to be linear in the number of states and the number of trajectories to be linear in a mixing time parameter. Experimental results support these guarantees, where we attain 96.6% average accuracy on a mixture of two MDPs in gridworld, outperforming the EM algorithm with random initialization (73.2% average accuracy).  ( 2 min )
    Technical Reports Compilation: Detecting the Fire Drill Anti-pattern Using Source Code and Issue-Tracking Data. (arXiv:2104.15090v8 [cs.SE] UPDATED)
    Detecting the presence of project management anti-patterns (AP) currently requires experts on the matter and is an expensive endeavor. Worse, experts may introduce their individual subjectivity or bias. Using the Fire Drill AP, we first introduce a novel way to translate descriptions into detectable AP that are comprised of arbitrary metrics and events such as logged time or maintenance activities, which are mined from the underlying source code or issue-tracking data, thus making the description objective as it becomes data-based. Secondly, we demonstrate a novel method to quantify and score the deviations of real-world projects to data-based AP descriptions. Using fifteen real-world projects that exhibit a Fire Drill to some degree, we show how to further enhance the translated AP. The ground truth in these projects was extracted from two individual experts and consensus was found between them. We introduce a novel method called automatic calibration, that optimizes a pattern such that only necessary and important scores remain that suffice to confidently detect the degree to which the AP is present. Without automatic calibration, the proposed patterns show only weak potential for detecting the presence. Enriching the AP with data from real-world projects significantly improves the potential. We also introduce a no-pattern approach that exploits the ground truth for establishing a new, quantitative understanding of the phenomenon, as well as for finding gray-/black-box predictive models. We conclude that the presence detection and severity assessment of the Fire Drill anti-pattern, as well as some of its related and similar patterns, is certainly possible using some of the presented approaches.  ( 3 min )
    TOAST: Topological Algorithm for Singularity Tracking. (arXiv:2210.00069v2 [cs.LG] UPDATED)
    The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.  ( 2 min )
    Factor-augmented tree ensembles. (arXiv:2111.14000v4 [stat.ML] UPDATED)
    This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. In doing so, this approach generalises time-series regression trees on two dimensions. First, it allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Second, it gives a transparent way for using domain-specific theory to inform time-series regression trees. As a byproduct, this technique sets the foundations for structuring powerful ensembles. Their real-world applicability is studied under the lenses of empirical macro-finance.  ( 2 min )
    Transfer learning for chemically accurate interatomic neural network potentials. (arXiv:2212.03916v2 [physics.comp-ph] UPDATED)
    Developing machine learning-based interatomic potentials from ab-initio electronic structure methods remains a challenging task for computational chemistry and materials science. This work studies the capability of transfer learning, in particular discriminative fine-tuning, for efficiently generating chemically accurate interatomic neural network potentials on organic molecules from the MD17 and ANI data sets. We show that pre-training the network parameters on data obtained from density functional calculations considerably improves the sample efficiency of models trained on more accurate ab-initio data. Additionally, we show that fine-tuning with energy labels alone can suffice to obtain accurate atomic forces and run large-scale atomistic simulations, provided a well-designed fine-tuning data set. We also investigate possible limitations of transfer learning, especially regarding the design and size of the pre-training and fine-tuning data sets. Finally, we provide GM-NN potentials pre-trained and fine-tuned on the ANI-1x and ANI-1ccx data sets, which can easily be fine-tuned on and applied to organic molecules.  ( 2 min )
    Robust empirical risk minimization via Newton's method. (arXiv:2301.13192v1 [stat.ML])
    We study a variant of Newton's method for empirical risk minimization, where at each iteration of the optimization algorithm, we replace the gradient and Hessian of the objective function by robust estimators taken from existing literature on robust mean estimation for multivariate data. After proving a general theorem about the convergence of successive iterates to a small ball around the population-level minimizer, we study consequences of our theory in generalized linear models, when data are generated from Huber's epsilon-contamination model and/or heavy-tailed distributions. We also propose an algorithm for obtaining robust Newton directions based on the conjugate gradient method, which may be more appropriate for high-dimensional settings, and provide conjectures about the convergence of the resulting algorithm. Compared to the robust gradient descent algorithm proposed by Prasad et al. (2020), our algorithm enjoys the faster rates of convergence for successive iterates often achieved by second-order algorithms for convex problems, i.e., quadratic convergence in a neighborhood of the optimum, with a stepsize that may be chosen adaptively via backtracking linesearch.  ( 2 min )
    A Novel Framework for Policy Mirror Descent with General Parametrization and Linear Convergence. (arXiv:2301.13139v1 [stat.ML])
    Modern policy optimization methods in applied reinforcement learning are often inspired by the trust region policy optimization algorithm, which can be interpreted as a particular instance of policy mirror descent. While theoretical guarantees have been established for this framework, particularly in the tabular setting, the use of a general parametrization scheme remains mostly unjustified. In this work, we introduce a novel framework for policy optimization based on mirror descent that naturally accommodates general parametrizations. The policy class induced by our scheme recovers known classes, e.g. tabular softmax, log-linear, and neural policies. It also generates new ones, depending on the choice of the mirror map. For a general mirror map and parametrization function, we establish the quasi-monotonicity of the updates in value function, global linear convergence rates, and we bound the total variation of the algorithm along its path. To showcase the ability of our framework to accommodate general parametrization schemes, we present a case study involving shallow neural networks.  ( 2 min )
    GFlowNets and variational inference. (arXiv:2210.00580v2 [cs.LG] UPDATED)
    This paper builds bridges between two families of probabilistic algorithms: (hierarchical) variational inference (VI), which is typically used to model distributions over continuous spaces, and generative flow networks (GFlowNets), which have been used for distributions over discrete structures such as graphs. We demonstrate that, in certain cases, VI algorithms are equivalent to special cases of GFlowNets in the sense of equality of expected gradients of their learning objectives. We then point out the differences between the two families and show how these differences emerge experimentally. Notably, GFlowNets, which borrow ideas from reinforcement learning, are more amenable than VI to off-policy training without the cost of high gradient variance induced by importance sampling. We argue that this property of GFlowNets can provide advantages for capturing diversity in multimodal target distributions.  ( 2 min )
    Selecting time-series hyperparameters with the artificial jackknife. (arXiv:2002.04697v5 [stat.ME] UPDATED)
    This article proposes a generalisation of the delete-$d$ jackknife to solve hyperparameter selection problems for time series. I call it artificial delete-$d$ jackknife to stress that this approach substitutes the classic removal step with a fictitious deletion, wherein observed datapoints are replaced with artificial missing values. This procedure keeps the data order intact and allows plain compatibility with time series. This manuscript justifies the use of this approach asymptotically and shows its finite-sample advantages through simulation studies. Besides, this article describes its real-world advantages by regulating high-dimensional forecasting models for foreign exchange rates.  ( 2 min )
    Scalable Spatiotemporally Varying Coefficient Modelling with Bayesian Kernelized Tensor Regression. (arXiv:2109.00046v3 [stat.ML] UPDATED)
    As a regression technique in spatial statistics, the spatiotemporally varying coefficient model (STVC) is an important tool for discovering nonstationary and interpretable response-covariate associations over both space and time. However, it is difficult to apply STVC for large-scale spatiotemporal analyses due to its high computational cost. To address this challenge, we summarize the spatiotemporally varying coefficients using a third-order tensor structure and propose to reformulate the spatiotemporally varying coefficient model as a special low-rank tensor regression problem. The low-rank decomposition can effectively model the global patterns of the large data sets with a substantially reduced number of parameters. To further incorporate the local spatiotemporal dependencies, we use Gaussian process (GP) priors on the spatial and temporal factor matrices. We refer to the overall framework as Bayesian Kernelized Tensor Regression (BKTR). For model inference, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm, which uses Gibbs sampling to update factor matrices and slice sampling to update kernel hyperparameters. We conduct extensive experiments on both synthetic and real-world data sets, and our results confirm the superior performance and efficiency of BKTR for model estimation and parameter inference.  ( 2 min )
    Gaussian Process Hydrodynamics. (arXiv:2209.10707v3 [physics.flu-dyn] UPDATED)
    We present a Gaussian Process (GP) approach (Gaussian Process Hydrodynamics, GPH) for approximating the solution of the Euler and Navier-Stokes equations. As in Smoothed Particle Hydrodynamics (SPH), GPH is a Lagrangian particle-based approach involving the tracking of a finite number of particles transported by the flow. However, these particles do not represent mollified particles of matter but carry discrete/partial information about the continuous flow. Closure is achieved by placing a divergence-free GP prior $\xi$ on the velocity field and conditioning on vorticity at particle locations. Known physics (e.g., the Richardson cascade and velocity-increments power laws) is incorporated into the GP prior through physics-informed additive kernels. This approach allows us to coarse-grain turbulence in a statistical manner rather than a deterministic one. By enforcing incompressibility and fluid/structure boundary conditions through the selection of the kernel, GPH requires much fewer particles than SPH. Since GPH has a natural probabilistic interpretation, numerical results come with uncertainty estimates enabling their incorporation into a UQ pipeline and the adding/removing of particles (quantas of information) in an adapted manner. The proposed approach is amenable to analysis, it inherits the complexity of state-of-the-art solvers for dense kernel matrices, and it leads to a natural definition of turbulence as information loss. Numerical experiments support the importance of selecting physics-informed kernels and illustrate the major impact of such kernels on accuracy and stability. Since the proposed approach has a Bayesian interpretation, it naturally enables data assimilation and making predictions and estimations based on mixing simulation data with experimental data.  ( 2 min )
    Variational Neural Networks. (arXiv:2207.01524v3 [cs.LG] UPDATED)
    Bayesian Neural Networks (BNNs) provide a tool to estimate the uncertainty of a neural network by considering a distribution over weights and sampling different models for each input. In this paper, we propose a method for uncertainty estimation in neural networks which, instead of considering a distribution over weights, samples outputs of each layer from a corresponding Gaussian distribution, parametrized by the predictions of mean and variance sub-layers. In uncertainty quality estimation experiments, we show that the proposed method achieves better uncertainty quality than other single-bin Bayesian Model Averaging methods, such as Monte Carlo Dropout or Bayes By Backpropagation methods.  ( 2 min )
    SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks. (arXiv:2206.05794v3 [cs.LG] UPDATED)
    In this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.  ( 2 min )
    Large-scale Model Personalization via Low Rank and Sparse decomposition. (arXiv:2210.03505v2 [cs.LG] UPDATED)
    Personalization of machine learning (ML) predictions for individual users/domains/enterprises is critical for practical recommendation style systems. Standard personalization approaches involve learning a user/domain specific embedding that is fed into a fixed global model which can be limiting. On the other hand, personalizing/fine-tuning model itself for each user/domain -- a.k.a meta-learning -- has high storage/infrastructure cost. We propose a novel meta-learning style approach that models network weights as a sum of low-rank and sparse matrices. This captures common information from multiple individuals/users together in the low-rank part while sparse part captures user-specific idiosyncrasies. Furthermore, the framework is up to two orders of magnitude more scalable (in terms of storage/infrastructure cost) than user-specific finetuning of model. We then study the framework in the linear setting, where the problem reduces to that of estimating the sum of a rank-$r$ and a $k$-column sparse matrix using a small number of linear measurements. We propose an alternating minimization method with iterative hard thresholding -- AMHT-LRS -- to learn the low-rank and sparse part. For the realizable, Gaussian data setting, we show that AMHT-LRS solves the problem efficiently with nearly optimal samples. A significant challenge in personalization is ensuring privacy of each user's sensitive data. We alleviate this problem by proposing a differentially private variant of our method that also is equipped with strong generalization guarantees. Finally, on multiple standard recommendation datasets, we demonstrate that our approach allows personalized models to obtain superior performance in sparse data regime.  ( 2 min )
    Safe and Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm. (arXiv:2211.05495v2 [cs.LG] UPDATED)
    We consider the problem of decision-making under uncertainty in an environment with safety constraints. Many business and industrial applications rely on real-time optimization to improve key performance indicators. In the case of unknown characteristics, real-time optimization becomes challenging, particularly because of the satisfaction of safety constraints. We propose the ARTEO algorithm, where we cast multi-armed bandits as a mathematical programming problem subject to safety constraints and learn the unknown characteristics through exploration while optimizing the targets. We quantify the uncertainty in unknown characteristics by using Gaussian processes and incorporate it into the cost function as a contribution which drives exploration. We adaptively control the size of this contribution in accordance with the requirements of the environment. We guarantee the safety of our algorithm with a high probability through confidence bounds constructed under the regularity assumptions of Gaussian processes. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to a safe variant of upper confidence bound based algorithms. ARTEO achieves less cumulative regret with accurate and safe decisions.  ( 2 min )
    Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs. (arXiv:2210.01376v2 [cs.LG] UPDATED)
    We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. For general strongly observable graphs, we develop an algorithm that achieves the optimal regret $\widetilde{\mathcal{O}}((\sum_{t=1}^T\alpha_t)^{1/2}+\max_{t\in[T]}\alpha_t)$ with high probability, where $\alpha_t$ is the independence number of the feedback graph at round $t$. Compared to the best existing result [Neu, 2015] which only considers graphs with self-loops for all nodes, our result not only holds more generally, but importantly also removes any $\text{poly}(K)$ dependence that can be prohibitively large for applications such as contextual bandits. Furthermore, we also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs, which even improves the best expected regret bound of [Alon et al., 2015] by removing the $\mathcal{O}(\sqrt{KT})$ term with a refined analysis. Our algorithms are based on the online mirror descent framework, but importantly with an innovative combination of several techniques. Notably, while earlier works use optimistic biased loss estimators for achieving high-probability bounds, we find it important to use a pessimistic one for nodes without self-loop in a strongly observable graph.  ( 2 min )
    Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods. (arXiv:2301.13006v1 [cs.LG])
    Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the state-of-the-art computational guarantees among all first-order methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea -- in conjunction with entropy regularization and adaptive learning rates -- to accelerate convergence.  ( 2 min )
    Refined Regret for Adversarial MDPs with Linear Function Approximation. (arXiv:2301.12942v1 [cs.LG])
    We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order $\tilde{\mathcal O}(K^{2/3})$ (omitting all other dependencies), given access to a simulator. This paper provides two algorithms that improve the regret to $\tilde{\mathcal O}(\sqrt K)$ in the same setting. Our first algorithm makes use of a refined analysis of the Follow-the-Regularized-Leader (FTRL) algorithm with the log-barrier regularizer. This analysis allows the loss estimators to be arbitrarily negative and might be of independent interest. Our second algorithm develops a magnitude-reduced loss estimator, further removing the polynomial dependency on the number of actions in the first algorithm and leading to the optimal regret bound (up to logarithmic terms and dependency on the horizon). Moreover, we also extend the first algorithm to simulator-free linear MDPs, which achieves $\tilde{\mathcal O}(K^{8/9})$ regret and greatly improves over the best existing bound $\tilde{\mathcal O}(K^{14/15})$. This algorithm relies on a better alternative to the Matrix Geometric Resampling procedure by Neu & Olkhovskaya (2020), which could again be of independent interest.  ( 2 min )
    Prediction of Customer Churn in Banking Industry. (arXiv:2301.13099v1 [stat.ML])
    With the growing competition in banking industry, banks are required to follow customer retention strategies while they are trying to increase their market share by acquiring new customers. This study compares the performance of six supervised classification techniques to suggest an efficient model to predict customer churn in banking industry, given 10 demographic and personal attributes from 10000 customers of European banks. The effect of feature selection, class imbalance, and outliers will be discussed for ANN and random forest as the two competing models. As shown, unlike random forest, ANN does not reveal any serious concern regarding overfitting and is also robust to noise. Therefore, ANN structure with five nodes in a single hidden layer is recognized as the best performing classifier.  ( 2 min )
    MixFlows: principled variational inference via mixed flows. (arXiv:2205.07475v3 [stat.ML] UPDATED)
    This work presents mixed variational flows (MixFlows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efficient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that MixFlows have MCMC-like convergence guarantees when the flow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the flow map is approximated. Finally, we develop an implementation of MixFlows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that MixFlows can provide more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.  ( 2 min )
    Reversible Gromov-Monge Sampler for Simulation-Based Inference. (arXiv:2109.14090v4 [stat.ME] UPDATED)
    This paper introduces a new simulation-based inference procedure to model and sample from multi-dimensional probability distributions given access to i.i.d.\ samples, circumventing the usual approaches of explicitly modeling the density function or designing Markov chain Monte Carlo. Motivated by the seminal work on distance and isomorphism between metric measure spaces, we propose a new notion called the Reversible Gromov-Monge (RGM) distance and study how RGM can be used to design new transform samplers to perform simulation-based inference. Our RGM sampler can also estimate optimal alignments between two heterogeneous metric measure spaces $(\cX, \mu, c_{\cX})$ and $(\cY, \nu, c_{\cY})$ from empirical data sets, with estimated maps that approximately push forward one measure $\mu$ to the other $\nu$, and vice versa. We study the analytic properties of the RGM distance and derive that under mild conditions, RGM equals the classic Gromov-Wasserstein distance. Curiously, drawing a connection to Brenier's polar factorization, we show that the RGM sampler induces bias towards strong isomorphism with proper choices of $c_{\cX}$ and $c_{\cY}$. Statistical rate of convergence, representation, and optimization questions regarding the induced sampler are studied. Synthetic and real-world examples showcasing the effectiveness of the RGM sampler are also demonstrated.  ( 2 min )
    Better Uncertainty Calibration via Proper Scores for Classification and Beyond. (arXiv:2203.07835v3 [cs.LG] UPDATED)
    With model trustworthiness being crucial for sensitive real-world applications, practitioners are putting more and more focus on improving the uncertainty calibration of deep neural networks. Calibration errors are designed to quantify the reliability of probabilistic predictions but their estimators are usually biased and inconsistent. In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper bound with optimal estimation properties. This relationship can be used to reliably quantify the model calibration improvement. We theoretically and empirically demonstrate the shortcomings of commonly used estimators compared to our approach. Due to the wide applicability of proper scores, this gives a natural extension of recalibration beyond classification.  ( 2 min )
    Accelerating Kernel Classifiers Through Borders Mapping. (arXiv:1708.05917v6 [stat.ML] UPDATED)
    Support vector machines (SVM) and other kernel techniques represent a family of powerful statistical classification methods with high accuracy and broad applicability. Because they use all or a significant portion of the training data, however, they can be slow, especially for large problems. Piecewise linear classifiers are similarly versatile, yet have the additional advantages of simplicity, ease of interpretation and, if the number of component linear classifiers is not too large, speed. Here we show how a simple, piecewise linear classifier can be trained from a kernel-based classifier in order to improve the classification speed. The method works by finding the root of the difference in conditional probabilities between pairs of opposite classes to build up a representation of the decision boundary. When tested on 17 different datasets, it succeeded in improving the classification speed of a SVM for 12 of them by up to two orders-of-magnitude. Of these, two were less accurate than a simple, linear classifier. The method is best suited to problems with continuum features data and smooth probability functions. Because the component linear classifiers are built up individually from an existing classifier, rather than through a simultaneous optimization procedure, the classifier is also fast to train.  ( 2 min )
    Stationary Kernels and Gaussian Processes on Lie Groups and their Homogeneous Spaces II: non-compact symmetric spaces. (arXiv:2301.13088v1 [stat.ME])
    Gaussian processes are arguably the most important class of spatiotemporal models within machine learning. They encode prior information about the modeled function and can be used for exact or approximate Bayesian learning. In many applications, particularly in physical sciences and engineering, but also in areas such as geostatistics and neuroscience, invariance to symmetries is one of the most fundamental forms of prior information one can consider. The invariance of a Gaussian process' covariance to such symmetries gives rise to the most natural generalization of the concept of stationarity to such spaces. In this work, we develop constructive and practical techniques for building stationary Gaussian processes on a very large class of non-Euclidean spaces arising in the context of symmetries. Our techniques make it possible to (i) calculate covariance kernels and (ii) sample from prior and posterior Gaussian processes defined on such spaces, both in a practical manner. This work is split into two parts, each involving different technical considerations: part I studies compact spaces, while part II studies non-compact spaces possessing certain structure. Our contributions make the non-Euclidean Gaussian process models we study compatible with well-understood computational techniques available in standard Gaussian process software packages, thereby making them accessible to practitioners.  ( 2 min )
    On the Sample Complexity of Actor-Critic Method for Reinforcement Learning with Function Approximation. (arXiv:1910.08412v3 [cs.LG] UPDATED)
    Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.  ( 2 min )
    Benchmarking optimality of time series classification methods in distinguishing diffusions. (arXiv:2301.13112v1 [stat.ML])
    Performance benchmarking is a crucial component of time series classification (TSC) algorithm design, and a fast-growing number of datasets have been established for empirical benchmarking. However, the empirical benchmarks are costly and do not guarantee statistical optimality. This study proposes to benchmark the optimality of TSC algorithms in distinguishing diffusion processes by the likelihood ratio test (LRT). The LRT is optimal in the sense of the Neyman-Pearson lemma: it has the smallest false positive rate among classifiers with a controlled level of false negative rate. The LRT requires the likelihood ratio of the time series to be computable. The diffusion processes from stochastic differential equations provide such time series and are flexible in design for generating linear or nonlinear time series. We demonstrate the benchmarking with three scalable state-of-the-art TSC algorithms: random forest, ResNet, and ROCKET. Test results show that they can achieve LRT optimality for univariate time series and multivariate Gaussian processes. However, these model-agnostic algorithms are suboptimal in classifying nonlinear multivariate time series from high-dimensional stochastic interacting particle systems. Additionally, the LRT benchmark provides tools to analyze the dependence of classification accuracy on the time length, dimension, temporal sampling frequency, and randomness of the time series. Thus, the LRT with diffusion processes can systematically and efficiently benchmark the optimality of TSC algorithms and may guide their future improvements.  ( 2 min )
    Revisiting Over-smoothing and Over-squashing using Ollivier-Ricci Curvature. (arXiv:2211.15779v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature, while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing.  ( 2 min )
    One-Shot Adaptation of GAN in Just One CLIP. (arXiv:2203.09301v4 [cs.CV] UPDATED)
    There are many recent research efforts to fine-tune a pre-trained generator with a few target images to generate images of a novel domain. Unfortunately, these methods often suffer from overfitting or under-fitting when fine-tuned with a single target image. To address this, here we present a novel single-shot GAN adaptation method through unified CLIP space manipulations. Specifically, our model employs a two-step training strategy: reference image search in the source generator using a CLIP-guided latent optimization, followed by generator fine-tuning with a novel loss function that imposes CLIP space consistency between the source and adapted generators. To further improve the adapted model to produce spatially consistent samples with respect to the source generator, we also propose contrastive regularization for patchwise relationships in the CLIP space. Experimental results show that our model generates diverse outputs with the target texture and outperforms the baseline models both qualitatively and quantitatively. Furthermore, we show that our CLIP space manipulation strategy allows more effective attribute editing.  ( 2 min )
    On student-teacher deviations in distillation: does it pay to disobey?. (arXiv:2301.12923v1 [cs.LG])
    Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must somehow deviate from the teacher (Stanton et al., 2021). But what is the nature of these deviations, and how do they relate to gains in generalization? We investigate these questions through a series of experiments across image and language classification datasets. First, we observe that distillation consistently deviates in a characteristic way: on points where the teacher has low confidence, the student achieves even lower confidence than the teacher. Secondly, we find that deviations in the initial dynamics of training are not crucial -- simply switching to distillation loss in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives to understand the role of student-teacher deviations in our experiments, one casting distillation as a regularizer in eigenspace, and another as a gradient denoiser. Our analysis bridges several gaps between existing theory and practice by (a) focusing on gradient-descent training, (b) by avoiding label noise assumptions, and (c) by unifying several disjoint empirical and theoretical findings.  ( 2 min )
    Cause-Effect Inference in Location-Scale Noise Models: Maximum Likelihood vs. Independence Testing. (arXiv:2301.12930v1 [cs.LG])
    Location-scale noise models (LSNMs) are a class of heteroscedastic structural causal models with wide applicability, closely related to affine flow models. Recent likelihood-based methods designed for LSNMs that infer cause-effect relationships achieve state-of-the-art accuracy, when their assumptions are satisfied concerning the noise distributions. However, under misspecification their accuracy deteriorates sharply, especially when the conditional variance in the anti-causal direction is smaller than that in the causal direction. In this paper, we demonstrate the misspecification problem and analyze why and when it occurs. We show that residual independence testing is much more robust to misspecification than likelihood-based cause-effect inference. Our empirical evaluation includes 580 synthetic and 99 real-world datasets.  ( 2 min )
    Curvature Filtrations for Graph Generative Model Evaluation. (arXiv:2301.12906v1 [cs.LG])
    Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property of graphs, and has recently started to prove useful in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with cutting-edge methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.  ( 2 min )
    Massively Scaling Heteroscedastic Classifiers. (arXiv:2301.12860v1 [cs.LG])
    Heteroscedastic classifiers, which learn a multivariate Gaussian distribution over prediction logits, have been shown to perform well on image classification problems with hundreds to thousands of classes. However, compared to standard classifiers, they introduce extra parameters that scale linearly with the number of classes. This makes them infeasible to apply to larger-scale problems. In addition heteroscedastic classifiers introduce a critical temperature hyperparameter which must be tuned. We propose HET-XL, a heteroscedastic classifier whose parameter count when compared to a standard classifier scales independently of the number of classes. In our large-scale settings, we show that we can remove the need to tune the temperature hyperparameter, by directly learning it on the training data. On large image classification datasets with up to 4B images and 30k classes our method requires 14X fewer additional parameters, does not require tuning the temperature on a held-out set and performs consistently better than the baseline heteroscedastic classifier. HET-XL improves ImageNet 0-shot classification in a multimodal contrastive learning setup which can be viewed as a 3.5 billion class classification problem.  ( 2 min )
    How Powerful are Shallow Neural Networks with Bandlimited Random Weights?. (arXiv:2008.08427v2 [cs.LG] UPDATED)
    We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.  ( 2 min )
    Fair and Optimal Classification via Post-Processing Predictors. (arXiv:2211.01528v2 [cs.LG] UPDATED)
    To address the bias exhibited by machine learning models, fairness criteria impose statistical constraints for ensuring equal treatment to all demographic groups, but typically at a cost to model performance. Understanding this tradeoff, therefore, underlies the design of fair and effective algorithms. This paper completes the characterization of the inherent tradeoff of demographic parity on classification problems in the most general multigroup, multiclass, and noisy setting. Specifically, we show that the minimum error rate is given by the optimal value of a Wasserstein-barycenter problem. More practically, this reformulation leads to a simple procedure for post-processing any pre-trained predictors to satisfy demographic parity in the general setting, which, in particular, yields the optimal fair classifier when applied to the Bayes predictor. We provide suboptimality and finite sample analyses for our procedure, and demonstrate precise control of the tradeoff of error rate for fairness on real-world datasets provided sufficient data.  ( 2 min )
    Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation. (arXiv:2301.13087v1 [cs.LG])
    We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions.We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses.Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.  ( 2 min )
    CAPITAL: Optimal Subgroup Identification via Constrained Policy Tree Search. (arXiv:2110.05636v3 [stat.ML] UPDATED)
    Personalized medicine, a paradigm of medicine tailored to a patient's characteristics, is an increasingly attractive field in health care. An important goal of personalized medicine is to identify a subgroup of patients, based on baseline covariates, that benefits more from the targeted treatment than other comparative treatments. Most of the current subgroup identification methods only focus on obtaining a subgroup with an enhanced treatment effect without paying attention to subgroup size. Yet, a clinically meaningful subgroup learning approach should identify the maximum number of patients who can benefit from the better treatment. In this paper, we present an optimal subgroup selection rule (SSR) that maximizes the number of selected patients, and in the meantime, achieves the pre-specified clinically meaningful mean outcome, such as the average treatment effect. We derive two equivalent theoretical forms of the optimal SSR based on the contrast function that describes the treatment-covariates interaction in the outcome. We further propose a ConstrAined PolIcy Tree seArch aLgorithm (CAPITAL) to find the optimal SSR within the interpretable decision tree class. The proposed method is flexible to handle multiple constraints that penalize the inclusion of patients with negative treatment effects, and to address time to event data using the restricted mean survival time as the clinically interesting mean outcome. Extensive simulations, comparison studies, and real data applications are conducted to demonstrate the validity and utility of our method.  ( 2 min )
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v3 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm and provide: (i) a generalization bound for EQRM; and (ii) the conditions under which EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG and demonstrate that EQRM outperforms state-of-the-art baselines on datasets from WILDS and DomainBed.  ( 2 min )
    Likelihood-Free Frequentist Inference: Confidence Sets with Correct Conditional Coverage. (arXiv:2107.03920v5 [stat.ML] UPDATED)
    Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions of complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, particularly outside asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce confidence sets with correct conditional coverage for small sample sizes. This paper unifies classical statistics with modern machine learning to present (i) a practical procedure for the Neyman construction of confidence sets with finite-sample guarantees of nominal coverage, and (ii) diagnostics that estimate conditional coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that defines a test statistic, like the likelihood ratio, can leverage the LF2I machinery to create valid confidence sets and diagnostics without costly Monte Carlo samples at fixed parameter settings. We study the power of two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our paper discusses the benefits and challenges of LF2I, with a breakdown of the sources of errors in LF2I confidence sets.  ( 2 min )
    Scalable Set Encoding with Universal Mini-Batch Consistency and Unbiased Full Set Gradient Approximation. (arXiv:2208.12401v3 [cs.LG] UPDATED)
    Recent work on mini-batch consistency (MBC) for set functions has brought attention to the need for sequentially processing and aggregating chunks of a partitioned set while guaranteeing the same output for all partitions. However, existing constraints on MBC architectures lead to models with limited expressive power. Additionally, prior work has not addressed how to deal with large sets during training when the full set gradient is required. To address these issues, we propose a Universally MBC (UMBC) class of set functions which can be used in conjunction with arbitrary non-MBC components while still satisfying MBC, enabling a wider range of function classes to be used in MBC settings. Furthermore, we propose an efficient MBC training algorithm which gives an unbiased approximation of the full set gradient and has a constant memory overhead for any set size for both train- and test-time. We conduct extensive experiments including image completion, text classification, unsupervised clustering, and cancer detection on high-resolution images to verify the efficiency and efficacy of our scalable set encoding framework.  ( 2 min )
    Interpolating between BSDEs and PINNs: deep learning for elliptic and parabolic boundary value problems. (arXiv:2112.03749v2 [math.NA] UPDATED)
    Solving high-dimensional partial differential equations is a recurrent challenge in economics, science and engineering. In recent years, a great number of computational approaches have been developed, most of them relying on a combination of Monte Carlo sampling and deep learning based approximation. For elliptic and parabolic problems, existing methods can broadly be classified into those resting on reformulations in terms of $\textit{backward stochastic differential equations}$ (BSDEs) and those aiming to minimize a regression-type $L^2$-error ($\textit{physics-informed neural networks}$, PINNs). In this paper, we review the literature and suggest a methodology based on the novel $\textit{diffusion loss}$ that interpolates between BSDEs and PINNs. Our contribution opens the door towards a unified understanding of numerical approaches for high-dimensional PDEs, as well as for implementations that combine the strengths of BSDEs and PINNs. The diffusion loss furthermore bears close similarities to $\textit{(least squares) temporal difference}$ objectives found in reinforcement learning. We also discuss eigenvalue problems and perform extensive numerical studies, including calculations of the ground state for nonlinear Schr\"odinger operators and committor functions relevant in molecular dynamics.  ( 2 min )
    A theory of continuous generative flow networks. (arXiv:2301.12594v1 [cs.LG])
    Generative flow networks (GFlowNets) are amortized variational inference algorithms that are trained to sample from unnormalized target distributions over compositional objects. A key limitation of GFlowNets until this time has been that they are restricted to discrete spaces. We present a theory for generalized GFlowNets, which encompasses both existing discrete GFlowNets and ones with continuous or hybrid state spaces, and perform experiments with two goals in mind. First, we illustrate critical points of the theory and the importance of various assumptions. Second, we empirically demonstrate how observations about discrete GFlowNets transfer to the continuous case and show strong results compared to non-GFlowNet baselines on several previously studied tasks. This work greatly widens the perspectives for the application of GFlowNets in probabilistic inference and various modeling settings.  ( 2 min )
    Intrinsic Bayesian Optimisation on Complex Constrained Domain. (arXiv:2301.12581v1 [stat.ML])
    Motivated by the success of Bayesian optimisation algorithms in the Euclidean space, we propose a novel approach to construct Intrinsic Bayesian optimisation (In-BO) on manifolds with a primary focus on complex constrained domains or irregular-shaped spaces arising as submanifolds of R2, R3 and beyond. Data may be collected in a spatial domain but restricted to a complex or intricately structured region corresponding to a geographic feature, such as lakes. Traditional Bayesian Optimisation (Tra-BO) defined with a Radial basis function (RBF) kernel cannot accommodate these complex constrained conditions. The In-BO uses the Sparse Intrinsic Gaussian Processes (SIn-GP) surrogate model to take into account the geometric structure of the manifold. SInGPs are constructed using the heat kernel of the manifold which is estimated as the transition density of the Brownian Motion on manifolds. The efficiency of In-BO is demonstrated through simulation studies on a U-shaped domain, a Bitten torus, and a real dataset from the Aral sea. Its performance is compared to that of traditional BO, which is defined in Euclidean space.  ( 2 min )
    Are Random Decompositions all we need in High Dimensional Bayesian Optimisation?. (arXiv:2301.12844v1 [cs.LG])
    Learning decompositions of expensive-to-evaluate black-box functions promises to scale Bayesian optimisation (BO) to high-dimensional problems. However, the success of these techniques depends on finding proper decompositions that accurately represent the black-box. While previous works learn those decompositions based on data, we investigate data-independent decomposition sampling rules in this paper. We find that data-driven learners of decompositions can be easily misled towards local decompositions that do not hold globally across the search space. Then, we formally show that a random tree-based decomposition sampler exhibits favourable theoretical guarantees that effectively trade off maximal information gain and functional mismatch between the actual black-box and its surrogate as provided by the decomposition. Those results motivate the development of the random decomposition upper-confidence bound algorithm (RDUCB) that is straightforward to implement - (almost) plug-and-play - and, surprisingly, yields significant empirical gains compared to the previous state-of-the-art on a comprehensive set of benchmarks. We also confirm the plug-and-play nature of our modelling component by integrating our method with HEBO, showing improved practical gains in the highest dimensional tasks from Bayesmark.  ( 2 min )
    PAC-Bayesian Soft Actor-Critic Learning. (arXiv:2301.12776v1 [cs.LG])
    Actor-critic algorithms address the dual goals of reinforcement learning, policy evaluation and improvement, via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that the online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably to the state of the art on multiple classical control and locomotion tasks in both sample efficiency and asymptotic performance.  ( 2 min )
    Bagging Provides Assumption-free Stability. (arXiv:2301.12600v1 [stat.ML])
    Bagging is an important technique for stabilizing machine learning models. In this paper, we derive a finite-sample guarantee on the stability of bagging for any model with bounded outputs. Our result places no assumptions on the distribution of the data, on the properties of the base algorithm, or on the dimensionality of the covariates. Our guarantee applies to many variants of bagging and is optimal up to a constant.  ( 2 min )
    Imbalanced Mixed Linear Regression. (arXiv:2301.12559v1 [stat.ML])
    We consider the problem of mixed linear regression (MLR), where each observed sample belongs to one of $K$ unknown linear models. In practical applications, the proportions of the $K$ components are often imbalanced. Unfortunately, most MLR methods do not perform well in such settings. Motivated by this practical challenge, in this work we propose Mix-IRLS, a novel, simple and fast algorithm for MLR with excellent performance on both balanced and imbalanced mixtures. In contrast to popular approaches that recover the $K$ models simultaneously, Mix-IRLS does it sequentially using tools from robust regression. Empirically, Mix-IRLS succeeds in a broad range of settings where other methods fail. These include imbalanced mixtures, small sample sizes, presence of outliers, and an unknown number of models $K$. In addition, Mix-IRLS outperforms competing methods on several real-world datasets, in some cases by a large margin. We complement our empirical results by deriving a recovery guarantee for Mix-IRLS, which highlights its advantage on imbalanced mixtures.  ( 2 min )
    Compression, Generalization and Learning. (arXiv:2301.12767v1 [cs.LG])
    A compression function is a map that slims down an observational set into a subset of reduced size, while preserving its informational content. In multiple applications, the condition that one new observation makes the compressed set change is interpreted that this observation brings in extra information and, in learning theory, this corresponds to misclassification, or misprediction. In this paper, we lay the foundations of a new theory that allows one to keep control on the probability of change of compression (called the "risk"). We identify conditions under which the cardinality of the compressed set is a consistent estimator for the risk (without any upper limit on the size of the compressed set) and prove unprecedentedly tight bounds to evaluate the risk under a generally applicable condition of preference. All results are usable in a fully agnostic setup, without requiring any a priori knowledge on the probability distribution of the observations. Not only these results offer a valid support to develop trust in observation-driven methodologies, they also play a fundamental role in learning techniques as a tool for hyper-parameter tuning.  ( 2 min )
    On Second-Order Scoring Rules for Epistemic Uncertainty Quantification. (arXiv:2301.12736v1 [cs.LG])
    It is well known that accurate probabilistic predictors can be trained through empirical risk minimisation with proper scoring rules as loss functions. While such learners capture so-called aleatoric uncertainty of predictions, various machine learning methods have recently been developed with the goal to let the learner also represent its epistemic uncertainty, i.e., the uncertainty caused by a lack of knowledge and data. An emerging branch of the literature proposes the use of a second-order learner that provides predictions in terms of distributions on probability distributions. However, recent work has revealed serious theoretical shortcomings for second-order predictors based on loss minimisation. In this paper, we generalise these findings and prove a more fundamental result: There seems to be no loss function that provides an incentive for a second-order learner to faithfully represent its epistemic uncertainty in the same manner as proper scoring rules do for standard (first-order) learners. As a main mathematical tool to prove this result, we introduce the generalised notion of second-order scoring rules.  ( 2 min )
    Machine Learning with High-Cardinality Categorical Features in Actuarial Applications. (arXiv:2301.12710v1 [stat.ML])
    High-cardinality categorical features are pervasive in actuarial data (e.g. occupation in commercial property insurance). Standard categorical encoding methods like one-hot encoding are inadequate in these settings. In this work, we present a novel _Generalised Linear Mixed Model Neural Network_ ("GLMMNet") approach to the modelling of high-cardinality categorical features. The GLMMNet integrates a generalised linear mixed model in a deep learning framework, offering the predictive power of neural networks and the transparency of random effects estimates, the latter of which cannot be obtained from the entity embedding models. Further, its flexibility to deal with any distribution in the exponential dispersion (ED) family makes it widely applicable to many actuarial contexts and beyond. We illustrate and compare the GLMMNet against existing approaches in a range of simulation experiments as well as in a real-life insurance case study. Notably, we find that the GLMMNet often outperforms or at least performs comparably with an entity embedded neural network, while providing the additional benefit of transparency, which is particularly valuable in practical applications. Importantly, while our model was motivated by actuarial applications, it can have wider applicability. The GLMMNet would suit any applications that involve high-cardinality categorical variables and where the response cannot be sufficiently modelled by a Gaussian distribution.  ( 2 min )
    Kernelized Cumulants: Beyond Kernel Mean Embeddings. (arXiv:2301.12466v1 [stat.ML])
    In $\mathbb R^d$, it is well-known that cumulants provide an alternative to moments that can achieve the same goals with numerous benefits such as lower variance estimators. In this paper we extend cumulants to reproducing kernel Hilbert spaces (RKHS) using tools from tensor algebras and show that they are computationally tractable by a kernel trick. These kernelized cumulants provide a new set of all-purpose statistics; the classical maximum mean discrepancy and Hilbert-Schmidt independence criterion arise as the degree one objects in our general construction. We argue both theoretically and empirically (on synthetic, environmental, and traffic data analysis) that going beyond degree one has several advantages and can be achieved with the same computational complexity and minimal overhead in our experiments.  ( 2 min )
    Implicit Regularization for Group Sparsity. (arXiv:2301.12540v1 [stat.ML])
    We study the implicit regularization of gradient descent towards structured sparsity via a novel neural reparameterization, which we call a diagonally grouped linear neural network. We show the following intriguing property of our reparameterization: gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure. In contrast to many existing works in understanding implicit regularization, we prove that our training trajectory cannot be simulated by mirror descent. We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates. Compared to existing bounds for implicit sparse regularization using diagonal linear networks, our analysis with the new reparameterization shows improved sample complexity. In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression. Finally, we demonstrate the efficacy of our approach with several numerical experiments.  ( 2 min )
    Don't Play Favorites: Minority Guidance for Diffusion Models. (arXiv:2301.12334v1 [cs.LG])
    We explore the problem of generating minority samples using diffusion models. The minority samples are instances that lie on low-density regions of a data manifold. Generating sufficient numbers of such minority instances is important, since they often contain some unique attributes of the data. However, the conventional generation process of the diffusion models mostly yields majority samples (that lie on high-density regions of the manifold) due to their high likelihoods, making themselves highly ineffective and time-consuming for the task. In this work, we present a novel framework that can make the generation process of the diffusion models focus on the minority samples. We first provide a new insight on the majority-focused nature of the diffusion models: they denoise in favor of the majority samples. The observation motivates us to introduce a metric that describes the uniqueness of a given sample. To address the inherent preference of the diffusion models w.r.t. the majority samples, we further develop minority guidance, a sampling technique that can guide the generation process toward regions with desired likelihood levels. Experiments on benchmark real datasets demonstrate that our minority guidance can greatly improve the capability of generating the low-likelihood minority samples over existing generative frameworks including the standard diffusion sampler.  ( 2 min )
    On Enhancing Expressive Power via Compositions of Single Fixed-Size ReLU Network. (arXiv:2301.12353v1 [cs.LG])
    This paper studies the expressive power of deep neural networks from the perspective of function compositions. We show that repeated compositions of a single fixed-size ReLU network can produce super expressive power. In particular, we prove by construction that $\mathcal{L}_2\circ \boldsymbol{g}^{\circ r}\circ \boldsymbol{\mathcal{L}}_1$ can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(r^{-1/d})$, where $\boldsymbol{g}$ is realized by a fixed-size ReLU network, $\boldsymbol{\mathcal{L}}_1$ and $\mathcal{L}_2$ are two affine linear maps matching the dimensions, and $\boldsymbol{g}^{\circ r}$ means the $r$-times composition of $\boldsymbol{g}$. Furthermore, we extend such a result to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Our results reveal that a continuous-depth network generated via a dynamical system has good approximation power even if its dynamics function is time-independent and realized by a fixed-size ReLU network.  ( 2 min )
    3D Object Detection in LiDAR Point Clouds using Graph Neural Networks. (arXiv:2301.12519v1 [cs.CV])
    LiDAR (Light Detection and Ranging) is an advanced active remote sensing technique working on the principle of time of travel (ToT) for capturing highly accurate 3D information of the surroundings. LiDAR has gained wide attention in research and development with the LiDAR industry expected to reach 2.8 billion $ by 2025. Although the LiDAR dataset is of rich density and high spatial resolution, it is challenging to process LiDAR data due to its inherent 3D geometry and massive volume. But such a high-resolution dataset possesses immense potential in many applications and has great potential in 3D object detection and recognition. In this research we propose Graph Neural Network (GNN) based framework to learn and identify the objects in the 3D LiDAR point clouds. GNNs are class of deep learning which learns the patterns and objects based on the principle of graph learning which have shown success in various 3D computer vision tasks.  ( 2 min )
    SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits. (arXiv:2301.12357v1 [stat.ML])
    In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain when executed in an environment formalized as a multi-armed bandit. In this paper, we focus on linear bandit setting with heteroscedastic reward noise. This is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting that reduces the MSE of the target policy. We term this as policy-weighted least square estimation and use this formulation to derive the optimal behavior policy for data collection. We then propose a novel algorithm SPEED (Structured Policy Evaluation Experimental Design) that tracks the optimal behavior policy and derive its regret with respect to the optimal behavior policy. Finally, we empirically validate that SPEED leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.  ( 2 min )
    On Heterogeneous Treatment Effects in Heterogeneous Causal Graphs. (arXiv:2301.12383v1 [stat.ME])
    Heterogeneity and comorbidity are two interwoven challenges associated with various healthcare problems that greatly hampered research on developing effective treatment and understanding of the underlying neurobiological mechanism. Very few studies have been conducted to investigate heterogeneous causal effects (HCEs) in graphical contexts due to the lack of statistical methods. To characterize this heterogeneity, we first conceptualize heterogeneous causal graphs (HCGs) by generalizing the causal graphical model with confounder-based interactions and multiple mediators. Such confounders with an interaction with the treatment are known as moderators. This allows us to flexibly produce HCGs given different moderators and explicitly characterize HCEs from the treatment or potential mediators on the outcome. We establish the theoretical forms of HCEs and derive their properties at the individual level in both linear and nonlinear models. An interactive structural learning is developed to estimate the complex HCGs and HCEs with confidence intervals provided. Our method is empirically justified by extensive simulations and its practical usefulness is illustrated by exploring causality among psychiatric disorders for trauma survivors.  ( 2 min )
    On Learning Necessary and Sufficient Causal Graphs. (arXiv:2301.12389v1 [cs.LG])
    The causal revolution has spurred interest in understanding complex relationships in various fields. Most existing methods aim to discover causal relationships among all variables in a large-scale complex graph. However, in practice, only a small number of variables in the graph are relevant for the outcomes of interest. As a result, causal estimation with the full causal graph -- especially given limited data -- could lead to many falsely discovered, spurious variables that may be highly correlated with but have no causal impact on the target outcome. In this paper, we propose to learn a class of necessary and sufficient causal graphs (NSCG) that only contains causally relevant variables for an outcome of interest, which we term causal features. The key idea is to utilize probabilities of causation to systematically evaluate the importance of features in the causal graph, allowing us to identify a subgraph that is relevant to the outcome of interest. To learn NSCG from data, we develop a score-based necessary and sufficient causal structural learning (NSCSL) algorithm, by establishing theoretical relationships between probabilities of causation and causal effects of features. Across empirical studies of simulated and real data, we show that the proposed NSCSL algorithm outperforms existing algorithms and can reveal important yeast genes for target heritable traits of interest.  ( 2 min )
    Multi-task Highly Adaptive Lasso. (arXiv:2301.12029v1 [stat.ML])
    We propose a novel, fully nonparametric approach for the multi-task learning, the Multi-task Highly Adaptive Lasso (MT-HAL). MT-HAL simultaneously learns features, samples and task associations important for the common model, while imposing a shared sparse structure among similar tasks. Given multiple tasks, our approach automatically finds a sparse sharing structure. The proposed MTL algorithm attains a powerful dimension-free convergence rate of $o_p(n^{-1/4})$ or better. We show that MT-HAL outperforms sparsity-based MTL competitors across a wide range of simulation studies, including settings with nonlinear and linear relationships, varying levels of sparsity and task correlations, and different numbers of covariates and sample size.  ( 2 min )
    Variational Latent Branching Model for Off-Policy Evaluation. (arXiv:2301.12056v1 [cs.LG])
    Model-based methods have recently shown great potential for off-policy evaluation (OPE); offline trajectories induced by behavioral policies are fitted to transitions of Markov decision processes (MDPs), which are used to rollout simulated trajectories and estimate the performance of policies. Model-based OPE methods face two key challenges. First, as offline trajectories are usually fixed, they tend to cover limited state and action space. Second, the performance of model-based methods can be sensitive to the initialization of their parameters. In this work, we propose the variational latent branching model (VLBM) to learn the transition function of MDPs by formulating the environmental dynamics as a compact latent space, from which the next states and rewards are then sampled. Specifically, VLBM leverages and extends the variational inference framework with the recurrent state alignment (RSA), which is designed to capture as much information underlying the limited training data, by smoothing out the information flow between the variational (encoding) and generative (decoding) part of VLBM. Moreover, we also introduce the branching architecture to improve the model's robustness against randomly initialized model weights. The effectiveness of the VLBM is evaluated on the deep OPE (DOPE) benchmark, from which the training trajectories are designed to result in varied coverage of the state-action space. We show that the VLBM outperforms existing state-of-the-art OPE methods in general.  ( 2 min )
    Decentralized Entropic Optimal Transport for Privacy-preserving Distributed Distribution Comparison. (arXiv:2301.12065v1 [cs.LG])
    Privacy-preserving distributed distribution comparison measures the distance between the distributions whose data are scattered across different agents in a distributed system and cannot be shared among the agents. In this study, we propose a novel decentralized entropic optimal transport (EOT) method, which provides a privacy-preserving and communication-efficient solution to this problem with theoretical guarantees. In particular, we design a mini-batch randomized block-coordinate descent (MRBCD) scheme to optimize the decentralized EOT distance in its dual form. The dual variables are scattered across different agents and updated locally and iteratively with limited communications among partial agents. The kernel matrix involved in the gradients of the dual variables is estimated by a distributed kernel approximation method, and each agent only needs to approximate and store a sub-kernel matrix by one-shot communication and without sharing raw data. We analyze our method's communication complexity and provide a theoretical bound for the approximation error caused by the convergence error, the approximated kernel, and the mismatch between the storage and communication protocols. Experiments on synthetic data and real-world distributed domain adaptation tasks demonstrate the effectiveness of our method.  ( 2 min )
    Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. (arXiv:2301.12083v1 [cs.LG])
    Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes its exponential decay; it, therefore, is readily applicable to applications with slower mixing times. Nonetheless, it achieves a convergence rate comparable to the state-of-the-art AC algorithms. We experimentally show that these alleviated restrictions on the technical conditions required for stability translate to superior performance in practice for RL problems with sparse rewards.  ( 2 min )
    Inference on the Optimal Assortment in the Multinomial Logit Model. (arXiv:2301.12254v1 [stat.ML])
    Assortment optimization has received active explorations in the past few decades due to its practical importance. Despite the extensive literature dealing with optimization algorithms and latent score estimation, uncertainty quantification for the optimal assortment still needs to be explored and is of great practical significance. Instead of estimating and recovering the complete optimal offer set, decision makers may only be interested in testing whether a given property holds true for the optimal assortment, such as whether they should include several products of interest in the optimal set, or how many categories of products the optimal set should include. This paper proposes a novel inferential framework for testing such properties. We consider the widely adopted multinomial logit (MNL) model, where we assume that each customer will purchase an item within the offered products with a probability proportional to the underlying preference score associated with the product. We reduce inferring a general optimal assortment property to quantifying the uncertainty associated with the sign change point detection of the marginal revenue gaps. We show the asymptotic normality of the marginal revenue gap estimator, and construct a maximum statistic via the gap estimators to detect the sign change point. By approximating the distribution of the maximum statistic with multiplier bootstrap techniques, we propose a valid testing procedure. We also conduct numerical experiments to assess the performance of our method.  ( 2 min )
    ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts. (arXiv:2301.12171v1 [cs.CV])
    Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we present a cost-effective strategy using text-prompt learning that keeps the entire CLIP module frozen while fully leveraging its rich information. Specifically, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport, which allows each text prompt to efficiently focus on specific semantic attributes. Additionally, we propose Deep Local Feature Alignment (DLFA) that deeply aligns the text prompts with intermediate local feature of the frozen image encoder layers, which significantly boosts the zero-shot segmentation performance. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance with only x7 lighter parameters compared to previous SOTA approaches.  ( 2 min )
    STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning. (arXiv:2301.12038v1 [cs.LG])
    Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm STEERING: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. We further establish that STEERING archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL, IDS included. Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.  ( 2 min )
    Minimizing Trajectory Curvature of ODE-based Generative Models. (arXiv:2301.12003v1 [cs.LG])
    Recent ODE/SDE-based generative models, such as diffusion models and flow matching, define a generative process as a time reversal of a fixed forward process. Even though these models show impressive performance on large-scale datasets, numerical simulation requires multiple evaluations of a neural network, leading to a slow sampling speed. We attribute the reason to the high curvature of the learned generative trajectories, as it is directly related to the truncation error of a numerical solver. Based on the relationship between the forward process and the curvature, here we present an efficient method of training the forward process to minimize the curvature of generative trajectories without any ODE/SDE simulation. Experiments show that our method achieves a lower curvature than previous models and, therefore, decreased sampling costs while maintaining competitive performance. Code is available at https://github.com/sangyun884/fast-ode.  ( 2 min )
    Quantum Ridgelet Transform: Winning Lottery Ticket of Neural Networks with Quantum Computation. (arXiv:2301.11936v1 [quant-ph])
    Ridgelet transform has been a fundamental mathematical tool in the theoretical studies of neural networks. However, the practical applicability of ridgelet transform to conducting learning tasks was limited since its numerical implementation by conventional classical computation requires an exponential runtime $\exp(O(D))$ as data dimension $D$ increases. To address this problem, we develop a quantum ridgelet transform (QRT), which implements the ridgelet transform of a quantum state within a linear runtime $O(D)$ of quantum computation. As an application, we also show that one can use QRT as a fundamental subroutine for quantum machine learning (QML) to efficiently find a sparse trainable subnetwork of large shallow wide neural networks without conducting large-scale optimization of the original network. This application discovers an efficient way in this regime to demonstrate the lottery ticket hypothesis on finding such a sparse trainable neural network. These results open an avenue of QML for accelerating learning tasks with commonly used classical neural networks.  ( 2 min )
    Alignment with human representations supports robust few-shot learning. (arXiv:2301.11990v1 [cs.LG])
    Should we care whether AI systems have representations of the world that are similar to those of humans? We provide an information-theoretic analysis that suggests that there should be a U-shaped relationship between the degree of representational alignment with humans and performance on few-shot learning tasks. We confirm this prediction empirically, finding such a relationship in an analysis of the performance of 491 computer vision models. We also show that highly-aligned models are more robust to both adversarial attacks and domain shifts. Our results suggest that human-alignment is often a sufficient, but not necessary, condition for models to make effective use of limited data, be robust, and generalize well.  ( 2 min )
    Leveraging Importance Weights in Subset Selection. (arXiv:2301.12052v1 [cs.LG])
    We present a subset selection algorithm designed to work with arbitrary model families in a practical batch setting. In such a setting, an algorithm can sample examples one at a time but, in order to limit overhead costs, is only able to update its state (i.e. further train model weights) once a large enough batch of examples is selected. Our algorithm, IWeS, selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. IWeS admits significant performance improvement compared to other subset selection algorithms for seven publicly available datasets. Additionally, it is competitive in an active learning setting, where the label information is not available at selection time. We also provide an initial theoretical analysis to support our importance weighting approach, proving generalization and sampling rate bounds.  ( 2 min )
    Reduced-Order Autodifferentiable Ensemble Kalman Filters. (arXiv:2301.11961v1 [stat.ML])
    This paper introduces a computational framework to reconstruct and forecast a partially observed state that evolves according to an unknown or expensive-to-simulate dynamical system. Our reduced-order autodifferentiable ensemble Kalman filters (ROAD-EnKFs) learn a latent low-dimensional surrogate model for the dynamics and a decoder that maps from the latent space to the state space. The learned dynamics and decoder are then used within an ensemble Kalman filter to reconstruct and forecast the state. Numerical experiments show that if the state dynamics exhibit a hidden low-dimensional structure, ROAD-EnKFs achieve higher accuracy at lower computational cost compared to existing methods. If such structure is not expressed in the latent state dynamics, ROAD-EnKFs achieve similar accuracy at lower cost, making them a promising approach for surrogate state reconstruction and forecasting.  ( 2 min )

  • Open

    [R] Are there any Machine Learning Journals that accept Viewpoint Papers (~1500+ words)?
    Basically the title. I have a sequence of two papers - a viewpoint and a complete paper in the works - that I'm looking to submit, the viewpoint outlining the theoretical premise for the latter. I've currently had no luck finding any ML-specific journals that allow viewpoint submissions (with the exception of simply posting to arXiv), and was wondering if anyone here was familiar with any. Thanks :D submitted by /u/Adi-Dewan [link] [comments]  ( 42 min )
    [P] Python wrapper of [ New AI classifier for indicating AI-written text from openai tool ]
    Openai is developing a new tool to help distinguish between AI-written and human-written text. Here is an unofficial python wrapper of openai model to detect if a text is written by #chatgpt , #gpt3 , #gpt etc Github: https://github.com/promptslab/openai-detector https://preview.redd.it/f45ggu45tgfa1.png?width=1122&format=png&auto=webp&s=4cb5ae70d7194cc3c070f3ad2dcbc968a804d4a3 submitted by /u/StoicBatman [link] [comments]  ( 42 min )
    Introducing NoRef-ER: A Multi-Language Referenceless ASR Metric (on HuggingFace) [R] [P]
    I am proud to announce the release of NoRefER, a multi-language referenceless ASR metric based on a fine-tuned language model, for public use on HuggingFace. This metric allows for evaluating the outputs of ASR models without needing a reference transcript, making it a valuable tool for a/b testing multiple ASR models or model versions, or even ensembling their outputs. ASR is an important technology with various applications, but the quality of ASR systems can vary greatly. It's important to accurately evaluate and compare the performance of different ASR models, traditionally done using reference-based ASR quality evaluation metrics. However, obtaining those ground-truth transcriptions from human annotators is time-consuming and costly. Referenceless quality evaluation is becoming impo…  ( 44 min )
    [N] Vincent Warmerdam: Calmcode, Explosion, Open Source and Data Science | Learning From Machine Learning #2
    https://www.youtube.com/watch?v=yvgxRzqx1Jg ​ Contents 00:00 Learning from Machine Learning Intro 00:21 Vincent Warmerdam Intro 01:18 Career Journey 03:25 What roles have you played? 05:44 Academic Background: Operations Research and Design 06:52 Operations Research 08:13 Mathematics 09:19 What attracted you to Machine Learning? 10:40 Calmcode 14:08 Calmcode, Do you use it? 15:22 Calmdcode, *args, **kwargs 16:23 If there were no constraints, what would you do to improve calmcode? 18:10 Open Source Projects: bulk, embetter, human-learn 19:10 Open Source: evol, scikit-lego 20:00 Rasa: Chatbots, Benchmarking 20:47 Unit Tests 21:42 Open Source: Creating Packages 24:10 Bulk, human-learn 26:20 27:03 Bulk in a notebook, bulk as a webapp 27:45 Human in the loop 29:03 Understanding the problem; Beans, Beef and Bread 32:56 Algorithm on the wrong problem 34:55 Module Improvement vs System Improvement 37:20 Does your answer make sense? 39:04 What's an important question that you believe remains unanswered in ML? 41:48 How do you view the gap between the hype and reality of AI? 46:28 Generative Models vs. Predictive Models 49:18 Jumping to solutions 50:08 Model vs. System 50:48 51:10 Who has influenced you in the field? 55:18 Humble, Caring Presenters 56:38 What's one piece of advice that you've received that's helped you? 01:00:18 Advice for people just starting in the field 01:03:15 What has a career in machine learning taught you about life? 01:05:16 SpaCy 01:06:10 Data-Centric Approach 01:06:50 Wrap-up 01:07:15 Follow, Explosion 01:07:48 Outro submitted by /u/NLPnerd [link] [comments]  ( 43 min )
    [D] Generative Model FOr Facts Extraction
    Is it possible to finetune a generative model (like T5) to do something like this: { inputs: "XYZ XYZ was born in ABC. They now live in DEF.", targets: "XYZ born in ABC XYZ lives in DEF" } Like the transformer model fom this paper if so how should I go about approaching the problem? Is this task as simple as feeding it the inputs and targets or do you guys think it has more to it? submitted by /u/Zetsu-Eiyu-O [link] [comments]  ( 43 min )
    [D] Open-source auto-ml services
    What're some good open-source auto-ml services mainly for image classification that're similar to Google's Vertex-AI submitted by /u/binaryshrey [link] [comments]  ( 42 min )
    [N] Monitor OpenAI API Latency, Tokens, Rate Limits, and More with Graphsignal
    Relying on hosted inference with LLMs in productions, such as via OpenAI API, has some challenges. The use of APIs should be designed around unstable latency, rate limits, token counts, costs, etc. To make it observable we've built tracing and monitoring specifically for AI apps. For example, the OpenAI Python library is monitored automatically, no need to do anything. We'll be adding support for more libraries. Here is a blog post with more info and screenshots: Monitor OpenAI API Latency, Tokens, Rate Limits, and More. And the GitHub repo. submitted by /u/l0g1cs [link] [comments]  ( 42 min )
    [P] Fine Tuning Whisper in another language
    Hi all, I'm trying to fine-tune Whisper AI to transcribe albanian speech to text but I have a problem in that I don't know how the dataset for training whisper model should look like. I already have voice audios and the transcript for that audio file but I need to know how to reformat it into a valid dataset for training Whisper. Thanks in advance! submitted by /u/ruizard [link] [comments]  ( 43 min )
    [D] Have researchers given up on traditional machine learning methods?
    This may be a silly question for those familiar with the field, but don't machine learning researchers expect any more prospects for traditional methods (I mean, "traditional" is other than deep learning)? I feel that most of the time when people talk about machine learning in the world today, they are referring to deep learning, but is this the same in the academic world? Have people who have been studying traditional methods switched to neural networks? I know that many researchers are excited about deep learning, but I am wondering what they think about other methods. submitted by /u/fujidaiti [link] [comments]  ( 52 min )
  • Open

    Snoop Dogg Giving a speech about Zombies | AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Is it possible to find a specific group of layers where vanishing gradients is starting and cascading ?
    I was curious to know if there are any particular layers which could be contributing to gradients vanishing and these vanishing gradients cascading down to subsequent lower layers, how to find these layers and if there has been any research work done on it ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
    Feature selection
    Hello, hoping I can get a simple question answered. How are features selected for neural nets where each instance in time has too much data to take in? Let’s say a 4K movie is going to play and we want to know if the next 3 seconds will contain X. We also don’t have a computer strong enough to take in all data at all times. Let’s say 20% of data is maxing our system. I have watched a ton of videos about how layers work but how does the NN take samples to put those samples into its filter? Is it up to the programmer to find clever ways of filtering down the data into certain “indicators” or something? submitted by /u/Joebone87 [link] [comments]  ( 41 min )
    Biologically plausible neural networks?
    Biologically plausible neural networks? So I'm wondering if you could possibly design ANN to simulate action potentials of neurons, and train them to do various tasks. I would hope that you can accurately simulate specific patterns of neural activity, when running a task. I'm sure this has already been done, but I'm wondering how big of a task it is to accomplish. Thanks! submitted by /u/daddydilly694-20 [link] [comments]  ( 42 min )
    Recurrent neural network in python (keras) error: ValueError: `logits` and `labels` must have the same shape, received ((None, 90, 1) vs (None,))
    I'm developing a recurrent neural network in python using keras to do binary classification on roulette wheel data. I'm trying to compile my code but it's crashing, could you help me fix the code please? Here is my code from keras.models import Sequential from keras.layers import Dense, Dropout from sklearn.preprocessing import MinMaxScaler import numpy as np import pandas as pd columns = ['data', 'resultado'] base = pd.read_csv("blaze_values_27_01_2023_VERMELHO_1.csv", header = None, names = columns) base = base.dropna() base_treinamento = base.iloc[:, 1:2] normalizador = MinMaxScaler(feature_range=[0,1]) base_treinamento_normalizada = normalizador.fit_transform(base_treinamento) previsores = [] saida_real = [] for i in range (90,1809): previsores.append(base_treinamento_normalizada[i-9…  ( 42 min )
  • Open

    Snoop Dogg Giving a speech about Zombies | AI Animation
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    "Accurate and Explainable Image-based Prediction Using a Lightweight Generative Model"
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    100 000 cells simulated, how many will we need to form something that looks like consciousness?
    submitted by /u/blob_evol_sim [link] [comments]  ( 41 min )
    NeuralShare - make any AI model accessible for free from any device, regardless of its hardware resources
    Hello humans! A project is about to start (https://github.com/neuralshare) which will allow any device, even low-cpu, to use and interact with AI models that they could not use since they require so many hardware resources. The project is called NeuralShare and will use the Stellar network to achieve its goal, as you can read on the README on Github at the link on this post. More specifically, it will use the Futurenet network and to achieve its purpose it will use the NEUR token, also on Futurenet, which was already created a little while ago and has no real value, has no speculative objective but only has one purpose, namely that of the project. We plan to distribute this token as an airdrop, totally free of course, and anyone who has this token will be able to use and interact with GPT-3 or other AI models for free simply by sending a NEUR transaction with a text memo attached containing the prompt. In return you will receive back a transaction with a text memo containing the response. However, this will only be possible when there are enough nodes (go to github to understand why),the precise number of nodes that must be active before the response can be received is still to be defined . For more details and updates you can add yourself to Discord, find the links on the README file on GitHub, as well as more details on the functioning of this method which will also allow you to use GPT-3 for free and without an api-key submitted by /u/0ut0flin3 [link] [comments]  ( 43 min )
    AI Related Newsletter
    I do not even remember signing up for an AI newsletter but I got an issue today from “AINow” that was actually pretty insightful/informative. Thought I’d share it as I see frequent posts asking where ppl get their info/news from. Newsletter Here submitted by /u/iwjahshehbs [link] [comments]  ( 40 min )
    AI Still Feels Artificial. What Are We Missing?
    submitted by /u/jrowley [link] [comments]  ( 40 min )
    Stable Diffusion + Dream Fusion + Text-to-Motion. This animation has been made in 5 minutes with the AI-Game Development platform I'm building. No coding or design skills needed, just text prompt engineering. Assets exportable in Unity. Seeking alpha testers
    submitted by /u/SpeaKrLipSync [link] [comments]  ( 41 min )
    This ended super convincing: O Captain! My Captain by Benedict Cumberbatch (ElevenLabs)
    submitted by /u/citizentim [link] [comments]  ( 41 min )
    OpenAI releases AI text detector for ChatGPT and other models
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    Anthropic's Claude: Ex-OpenAI Employees Launches ChatGPT Rival
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    Which are the opensource SOTA for voice conversion and/or voice cloning for the Indic languages ?
    Lot of ok-ish voice cloning and conversion tools are available in market, but most have an American English tone to them. What are the opensource SOTA for voice conversion and/or voice cloning for the Indic languages ? submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    [Searchcolab] BREAKING!! And update to ChatGPT just launched.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    What tech jobs will be safe from AI over the next decade?
    I'm studying computer networking but I'm certain that AI can easily set up virtual networks and cloud computing solutions on their own. Should I have followed my dream to become a bricklayer? submitted by /u/black_linux_guy [link] [comments]  ( 41 min )
    AI Music Videos
    Does anybody know how people on Instagram and TikTok are creating these AI music videos? I see them everywhere but they gatekeep what ever they’re using. submitted by /u/mynameisbob1011 [link] [comments]  ( 40 min )
    📌[Searchcolab] "Gotham during Recession" Link in comments.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Where do you go for news updates and insights? Not Including Here
    Hello all! ​ I am new to this subreddit, but I have begun the fascinating descent into learning about all things AI and its ilk. I would like your help. ​ I want to turn on the firehose and consume all the content I can. ​ - What are your favorite news sources and media creators? ​ - How do you find the latest tools and research papers? ​ - How do you see where the conversation is going and stay updated with new opinions? ​ This would be incredibly helpful, and I hope I posted right. submitted by /u/WaffleHouseBaby [link] [comments]  ( 41 min )
    AI Certifications
    Hi All -- I've been in IT for about 15 years. I started as a sysadmin, segued into dev work, and have been specializing in systems & data integration and business process automation for the past 5ish years. I also studied Computer Science for 4 years but never graduated (that's a whole other sobstory -- won't bore anyone with that) I've been considering the direction I want to go for the next leg of my career and I feel like AI/ML is the logical next step. I've started building a portfolio of my AI/ML projects and I'd like to pick up a cert or two to compliment them. My ultimate goal is solutions architecture, but I love the engineering side of things and expect to start there. What I want to know is what certifications everyone would recommend -- I've been eyeing the IBM-sponsored certificate tracks on Coursera but I've seen a few others and they all have their merits. I'm just not sure how much value they carry in the job market. Are there any worth staying well away from? submitted by /u/am_i_the_rabbit [link] [comments]  ( 41 min )
    DHT(BitTorrent) network that replaces websites/dns
    Soon we will have very powerful assistants.They can interact with databases directly.E.g. you ask for news articles from BBC.In the case of assistants they don't need human readable url.They can sql query database on 192.168.1.1 with ease.And then they present the data in appropriate format.And dht is just for discovery. submitted by /u/nikitastaf1996 [link] [comments]  ( 41 min )
    ChatGPT Could Destroy Google In A Few Years, According To Gmail’s Creator
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Princeton computer science professor says don’t panic over ‘bullshit generator’ ChatGPT
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    A tool to run GPT3 responses on your offline data and custom data sources
    At DocuStack, our goal is to help organisations turn their internal knowledge into an answering bot, saving time and increasing efficiency for teams. If you're familiar with the challenges of managing internal knowledge and answering repetitive questions, I believe you will find DocuStack to be a valuable tool. With features such as a searchable database, customisable answers, multi-language support, and offline file upload capabilities, DocuStack makes it easy to manage your internal knowledge and provide quick, accurate answers to your team. Here is the link https://www.docustack.ai/chat This is our first launch of a wider product for customer support. would love to hear your thoughts and do not forget join the waitlist, if you would want to try the beta version. submitted by /u/titansaurabh [link] [comments]  ( 41 min )
    Science and Engineering Fair Project regarding AI
    Hello everyone. I'm passionate about AI and just computer science in general. I would like to participate in a Science and Engineering Fair, but I want to have an idea of what I'm going to do first. What can I do involving AI and do very well in the fair? If not AI, then computer science in general. Thank you submitted by /u/Sufficient-Ad-8881 [link] [comments]  ( 42 min )
    The generative AI revolution has begun—how did we get here? | "A new class of incredibly powerful AI models has made recent breakthroughs possible."
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    [D] I have been working in UI development for around 12 years having worked in pretty much everything that is there in UI. But lately I feel AI is most definitely something that has the max scope and potential in the future. So my question is how do I get into the field of AI? what would be roadmap?
    submitted by /u/anuratya [link] [comments]  ( 43 min )
  • Open

    Learning an action using PPO reinforcement learning that is also a negative reward?
    I’m doing RL on a problem where I learn 2 actions and my reward = action 1 - action 2. Since action 2 is getting subtracted, the agent learns to output 0.0 value for action 2 (action space for both the actions is between 0.0 and 0.1). Can someone please advice how can I make the agent explore non-zero values for action 2. submitted by /u/HonestScratch1827 [link] [comments]  ( 41 min )
    Multi-Agent RL for Ranged Army Combat Micro-Management (Like Dragon PvP Fight in StarCraft)
    I would like to invite interested people to collaborate on this hobby project of mine. This is still in an early-stage, and I believe it can be significantly improved together. The GitHub repository link is here: https://github.com/kayuksel/multi-rl-crowd-sim Note: The difference from StarCraft is that Dragons can hide behind each other. They also reduce their strength of hitting, propotional to decrease of their health. https://preview.redd.it/wrpcaz782dfa1.png?width=640&format=png&auto=webp&s=1dede69acb78e874a80bd532af85b269c7117f9f submitted by /u/k_yuksel [link] [comments]  ( 41 min )
    Autotuned temperature for SAC
    Has anyone ever monitored the behavior of alpha in autotuned SAC ? I implemented it and it seems to work, but I would be interested in seeing a commented graph of the evolution of alpha during the learning process, and I could not find a contribution including this. submitted by /u/Scrimbibete [link] [comments]  ( 41 min )
    Odd Reward behavior
    Hi all, I'm training an Agent (to control a platform to maintain attitude) but I'm having problems understanding the following behavior: R = A - penalty I thought adding 1.0 would increase the cumulative reward but that's not the case. R1 = A - penalty + 1.0 R1 ends up being less than R. ​ In light of this, I multiplied penalty by 10 to see what happens: R2 = A - 10.0*penalty This, increases cumulative reward (R2 > R). ​ Note that 'A' and 'penalty' are always positive values. Any idea what this means (and how to go about shaping R)? submitted by /u/XecutionStyle [link] [comments]  ( 46 min )
  • Open

    Cyberpunk 2077 Brings a Taste of the Future With DLSS
    Analyst reports. Academic papers. Ph.D. programs. There are a lot of places you can go to get a glimpse of the future. But the best place might just be El Coyote Cojo, a whiskey-soaked dive bar that doesn’t exist in real life. Fire up Cyberpunk 2077 and you’ll see much more than the watering hole’s Read article >  ( 6 min )
    Broadcaster ‘Nilson1489’ Shares Livestreaming Techniques and More This Week ‘In the NVIDIA Studio’
    Broadcasters have an arsenal of new features and technologies at their disposal; the eighth-generation NVIDIA video encoder on RTX 40 Series GPUs with support for the open AV1 video-coding format; new NVIDIA Broadcast app effects like Eye Contact and Vignette; and support for AV1 streaming in Discord.  ( 7 min )
  • Open

    New AI classifier for indicating AI-written text
    We’re launching a classifier trained to distinguish between AI-written and human-written text. We’ve trained a classifier to distinguish between text written by a human and text written by AIs from a variety of providers. While it is impossible to reliably detect all AI-written text, we believe  ( 3 min )
  • Open

    DSC Weekly 31 January 2023 – Data Models for the Weather
    Announcements Data Models for the Weather With January coming to an end, we here in the Northeast let out a collective sigh of relief as the month ends without any major snowstorms that tend to happen in the first month of the year. Weather forecasting is a centuries-old practice that has its roots in divination… Read More »DSC Weekly 31 January 2023 – Data Models for the Weather The post DSC Weekly 31 January 2023 – Data Models for the Weather appeared first on Data Science Central.  ( 19 min )
    Explaining FAIR Data to Aunt Doris
    I’m sure you’ve run into this situation yourself. You’re at a family gathering, and someone at the table asks you exactly what you do for a living. Maybe it’s your uncle, a grandparent, or a child. You try to describe in simple terms what you do, but they get a mystified expression on their face.… Read More »Explaining FAIR Data to Aunt Doris The post Explaining FAIR Data to Aunt Doris appeared first on Data Science Central.  ( 21 min )
    Java in Cloud Native Environment: All You Need To Know
    Java has been a prevalent programming language. Even today, it remains one of the top three most-used languages for developing enterprise software. New cloud-native Java runtimes must provide developers with the following four significant benefits. It helps build cloud-native, microservices, and serverless Java applications: Traditional Java applications run as containers on hardware servers that control… Read More »Java in Cloud Native Environment: All You Need To Know The post Java in Cloud Native Environment: All You Need To Know appeared first on Data Science Central.  ( 20 min )
    NIST Artificial Intelligence Risk Management Framework
    The National Institute of Standards and Technology (NIST) has released it’s Artificial Intelligence Risk Management Framework (AI RMF 1.0), a guidance document for voluntary use by organizations designing, developing, deploying, or using AI systems to help manage the many risks of AI technologies. The NIST AI Risk standards provide a practical and bipartisan perspective for… Read More »NIST Artificial Intelligence Risk Management Framework The post NIST Artificial Intelligence Risk Management Framework appeared first on Data Science Central.  ( 19 min )
    Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II
    In the previous article, we looked at two Ever-Successful NFL teams, the Kansas City Chiefs and the San Francisco 49ers, who seem to be able to win consistently even while things change around them and players and coaches come and go.  Then, we looked at two Never-Successful teams, the Arizona Cardinals and the Cleveland Browns,… Read More »Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II The post Exploding vs. Imploding: What the NFL Has to Teach Us About Managing Agile Enterprises, Part II appeared first on Data Science Central.  ( 26 min )
  • Open

    Avoid having to integrate by parts twice
    Suppose f(x) and g(x) are functions that are each proportional to their second derivative. These include exponential, circular, and hyperbolic functions. Then the integral of f(x) g(x) can be computed in closed form with a moderate amount of work. The first time you see how such integrals are computed, it’s an interesting trick. I wrote […] Avoid having to integrate by parts twice first appeared on John D. Cook.  ( 5 min )
  • Open

    Mo\^usai: Text-to-Music Generation with Long-Context Latent Diffusion. (arXiv:2301.11757v1 [cs.CL])
    The recent surge in popularity of diffusion models for image generation has brought new attention to the potential of these models in other areas of media synthesis. One area that has yet to be fully explored is the application of diffusion models to music generation. Music generation requires to handle multiple aspects, including the temporal dimension, long-term structure, multiple layers of overlapping sounds, and nuances that only trained listeners can detect. In our work, we investigate the potential of diffusion models for text-conditional music generation. We develop a cascading latent diffusion approach that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. For each model, we make an effort to maintain reasonable inference speed, targeting real-time on a single consumer GPU. In addition to trained models, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: - Music samples for this paper: https://bit.ly/anonymous-mousai - All music samples for all models: https://bit.ly/audio-diffusion - Codes: https://github.com/archinetai/audio-diffusion-pytorch  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v2 [cs.LG] UPDATED)
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Input Perturbation Reduces Exposure Bias in Diffusion Models. (arXiv:2301.11706v1 [cs.LG])
    Denoising Diffusion Probabilistic Models have shown an impressive generation quality, although their long sampling chain leads to high computational costs. In this paper, we observe that a long sampling chain also leads to an error accumulation phenomenon, which is similar to the \textbf{exposure bias} problem in autoregressive text generation. Specifically, we note that there is a discrepancy between training and testing, since the former is conditioned on the ground truth samples, while the latter is conditioned on the previously generated results. To alleviate this problem, we propose a very simple but effective training regularization, consisting in perturbing the ground truth samples to simulate the inference time prediction errors. We empirically show that the proposed input perturbation leads to a significant improvement of the sample quality while reducing both the training and the inference times. For instance, on CelebA 64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving 37.5% of the training time.  ( 2 min )
    Learning Visual Representations for Transfer Learning by Suppressing Texture. (arXiv:2011.01901v3 [cs.CV] UPDATED)
    Recent literature has shown that features obtained from supervised training of CNNs may over-emphasize texture rather than encoding high-level information. In self-supervised learning in particular, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations. To address these problems we propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture. This simple method helps retain important edge information and suppress texture at the same time. We empirically show that our method achieves state-of-the-art results on object detection and image classification with eight diverse datasets in either supervised or self-supervised learning tasks such as MoCoV2 and Jigsaw. Our method is particularly effective for transfer learning tasks and we observed improved performance on five standard transfer learning datasets. The large improvements (up to 11.49\%) on the Sketch-ImageNet dataset, DTD dataset and additional visual analyses with saliency maps suggest that our approach helps in learning better representations that better transfer.  ( 2 min )
    DBGSL: Dynamic Brain Graph Structure Learning. (arXiv:2209.13513v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have shown success at learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. The majority of existing GNN methods, however, assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions are at odds with neuroscientific evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Noisy brain graphs that do not truly represent the underling fMRI data can have a detrimental impact on the performance of GNNs. As a solution, we propose Dynamic Brain Graph Structure Learning (DBGSL), a novel method for learning the optimal time-varying dependency structure of fMRI data induced by a downstream prediction task. Experiments demonstrate DBGSL achieves state-of-the-art performance for sex classification using real-world resting-state and task fMRI data. Moreover, analysis of the learnt dynamic graphs highlights prediction-related brain regions which align with existing neuroscience literature.  ( 2 min )
    Normality-Guided Distributional Reinforcement Learning for Continuous Control. (arXiv:2208.13125v2 [cs.LG] UPDATED)
    Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) methods instead model the value distribution, which has been shown to improve performance in many settings. In this paper, we model the value distribution as approximately normal using the Markov Chain central limit theorem. We analytically compute quantile bars to provide a new DRL target that is informed by the decrease in standard deviation that occurs over the course of an episode. In addition, we propose a policy update strategy based on uncertainty as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds and show that our methods produce performance improvements in continuous control tasks.  ( 2 min )
    Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits. (arXiv:2210.00025v2 [cs.LG] UPDATED)
    How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean. (arXiv:2301.11862v1 [stat.ML])
    Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.  ( 2 min )
    Challenging Common Assumptions in Convex Reinforcement Learning. (arXiv:2202.01511v3 [cs.LG] UPDATED)
    The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives can be proved to coincide and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in the convex RL setting. In particular, we show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.  ( 2 min )
    Bayesian Self-Supervised Contrastive Learning. (arXiv:2301.11673v1 [cs.LG])
    Recent years have witnessed many successful applications of contrastive learning in diverse domains, yet its self-supervised version still remains many exciting challenges. As the negative samples are drawn from unlabeled datasets, a randomly selected sample may be actually a false negative to an anchor, leading to incorrect encoder training. This paper proposes a new self-supervised contrastive loss called the BCL loss that still uses random samples from the unlabeled data while correcting the resulting bias with importance weights. The key idea is to design the desired sampling distribution for sampling hard true negative samples under the Bayesian framework. The prominent advantage lies in that the desired sampling distribution is a parametric structure, with a location parameter for debiasing false negative and concentration parameter for mining hard negative, respectively. Experiments validate the effectiveness and superiority of the BCL loss.  ( 2 min )
    Rethinking Assumptions in Deep Anomaly Detection. (arXiv:2006.00339v3 [cs.LG] UPDATED)
    Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly seems not to extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.  ( 2 min )
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v2 [cs.LG] UPDATED)
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v3 [stat.ML] UPDATED)
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.  ( 2 min )
    Can We Faithfully Represent Absence States to Compute Shapley Values on a DNN?. (arXiv:2105.10719v3 [cs.LG] UPDATED)
    Although many methods have been proposed to estimate attributions of input variables, there still exists a significant theoretical flaw in masking-based attribution methods, i.e., it is hard to examine whether the masking method faithfully represents the absence of input variables. Specifically, for masking-based attributions, setting an input variable to the baseline value is a typical way of representing the absence of the variable. However, there are no studies investigating how to represent the absence of input variables and verify the faithfulness of baseline values. Therefore, we revisit the feature representation of a DNN in terms of causality, and propose to use causal patterns to examine whether the masking method faithfully removes information encoded in input variables. More crucially, it is proven that the causality can be explained as the elementary rationale of the Shapley value. Furthermore, we define the optimal baseline value from the perspective of causality, and we propose a method to learn the optimal baseline value. Experimental results have demonstrated the effectiveness of our method.  ( 2 min )
    Theoretical Analysis of Offline Imitation With Supplementary Dataset. (arXiv:2301.11687v1 [cs.LG])
    Behavioral cloning (BC) can recover a good policy from abundant expert data, but may fail when expert data is insufficient. This paper considers a situation where, besides the small amount of expert data, a supplementary dataset is available, which can be collected cheaply from sub-optimal policies. Imitation learning with a supplementary dataset is an emergent practical framework, but its theoretical foundation remains under-developed. To advance understanding, we first investigate a direct extension of BC, called NBCU, that learns from the union of all available data. Our analysis shows that, although NBCU suffers an imitation gap that is larger than BC in the worst case, there exist special cases where NBCU performs better than or equally well as BC. This discovery implies that noisy data can also be helpful if utilized elaborately. Therefore, we further introduce a discriminator-based importance sampling technique to re-weight the supplementary data, proposing the WBCU method. With our newly developed landscape-based analysis, we prove that WBCU can outperform BC in mild conditions. Empirical studies show that WBCU simultaneously achieves the best performance on two challenging tasks where prior state-of-the-art methods fail.  ( 2 min )
    Interpreting learning in biological neural networks as zero-order optimization method. (arXiv:2301.11777v1 [cs.LG])
    Recently, significant progress has been made regarding the statistical understanding of artificial neural networks (ANNs). ANNs are motivated by the functioning of the brain, but differ in several crucial aspects. In particular, it is biologically implausible that the learning of the brain is based on gradient descent. In this work we look at the brain as a statistical method for supervised learning. The main contribution is to relate the local updating rule of the connection parameters in biological neural networks (BNNs) to a zero-order optimization method.  ( 2 min )
    Reinforcement Learning from Diverse Human Preferences. (arXiv:2301.11774v1 [cs.LG])
    The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v2 [cs.LG] UPDATED)
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v3 [cs.LG] UPDATED)
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.
    CADet: Fully Self-Supervised Out-Of-Distribution Detection With Contrastive Learning. (arXiv:2210.01742v2 [cs.LG] UPDATED)
    Handling out-of-distribution (OOD) samples has become a major stake in the real-world deployment of machine learning systems. This work explores the application of self-supervised contrastive learning to the simultaneous detection of two types of OOD samples: unseen classes and adversarial perturbations. Since in practice the distribution of such samples is not known in advance, we do not assume access to OOD examples. We first show that similarity functions trained with contrastive learning can be leveraged with the maximum mean discrepancy (MMD) two-sample test to verify whether two independent sets of samples are drawn from the same distribution. Inspired by this approach, we introduce CADet (Contrastive Anomaly Detection), a method based on contrastive transformations to perform anomaly detection on single samples. CADet compares favorably to adversarial detection methods to detect adversarially perturbed samples on ImageNet. Simultaneously, it achieves comparable performance to unseen label detection methods on two challenging benchmarks: ImageNet-O and iNaturalist. CADet is fully self-supervised and requires neither labels for in-distribution samples nor access to OOD examples.
    Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization. (arXiv:2212.10445v2 [cs.LG] UPDATED)
    Foundation models are redefining how AI systems are built. Practitioners now follow a standard procedure to build their machine learning solutions: from a pre-trained foundation model, they fine-tune the weights on the target task of interest. So, the Internet is swarmed by a handful of foundation models fine-tuned on many diverse tasks: these individual fine-tunings exist in isolation without benefiting from each other. In our opinion, this is a missed opportunity, as these specialized models contain rich and diverse features. In this paper, we thus propose model ratatouille, a new strategy to recycle the multiple fine-tunings of the same foundation model on diverse auxiliary tasks. Specifically, we repurpose these auxiliary weights as initializations for multiple parallel fine-tunings on the target task; then, we average all fine-tuned weights to obtain the final model. This recycling strategy aims at maximizing the diversity in weights by leveraging the diversity in auxiliary tasks. Empirically, it improves the state of the art on the reference DomainBed benchmark for out-of-distribution generalization. Looking forward, this work contributes to the emerging paradigm of updatable machine learning where, akin to open-source software development, the community collaborates to reliably update machine learning models.
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v2 [stat.ML] UPDATED)
    We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.
    Adapting Step-size: A Unified Perspective to Analyze and Improve Gradient-based Methods for Adversarial Attacks. (arXiv:2301.11546v1 [cs.LG])
    Learning adversarial examples can be formulated as an optimization problem of maximizing the loss function with some box-constraints. However, for solving this induced optimization problem, the state-of-the-art gradient-based methods such as FGSM, I-FGSM and MI-FGSM look different from their original methods especially in updating the direction, which makes it difficult to understand them and then leaves some theoretical issues to be addressed in viewpoint of optimization. In this paper, from the perspective of adapting step-size, we provide a unified theoretical interpretation of these gradient-based adversarial learning methods. We show that each of these algorithms is in fact a specific reformulation of their original gradient methods but using the step-size rules with only current gradient information. Motivated by such analysis, we present a broad class of adaptive gradient-based algorithms based on the regular gradient methods, in which the step-size strategy utilizing information of the accumulated gradients is integrated. Such adaptive step-size strategies directly normalize the scale of the gradients rather than use some empirical operations. The important benefit is that convergence for the iterative algorithms is guaranteed and then the whole optimization process can be stabilized. The experiments demonstrate that our AdaI-FGM consistently outperforms I-FGSM and AdaMI-FGM remains competitive with MI-FGSM for black-box attacks.
    Synthetic A/B Testing using Synthetic Interventions. (arXiv:2006.07691v5 [econ.EM] UPDATED)
    Suppose there are $N$ units and $D$ interventions. We aim to learn the average potential outcome associated with every unit-intervention pair, i.e., $N \times D$ causal parameters. While running $N \times D$ experiments is conceivable, it can be expensive or infeasible. This work introduces an experiment design, synthetic A/B testing, and the synthetic interventions (SI) estimator to recover all $N \times D$ causal parameters while observing each unit under at most two interventions, independent of $D$. Under a novel tensor factor model for potential outcomes across units, measurements, and interventions, we establish the identification of each parameter. Further, we show the SI estimator is finite-sample consistent and asymptotically normal. Collectively, these also lead to novel results for panel data settings, particularly for synthetic controls. We empirically validate our experiment design using real e-commerce data from a large-scale A/B test.
    Uplink Scheduling in Federated Learning: an Importance-Aware Approach via Graph Representation Learning. (arXiv:2301.11903v1 [cs.NI])
    Federated Learning (FL) has emerged as a promising framework for distributed training of AI-based services, applications, and network procedures in 6G. One of the major challenges affecting the performance and efficiency of 6G wireless FL systems is the massive scheduling of user devices over resource-constrained channels. In this work, we argue that the uplink scheduling of FL client devices is a problem with a rich relational structure. To address this challenge, we propose a novel, energy-efficient, and importance-aware metric for client scheduling in FL applications by leveraging Unsupervised Graph Representation Learning (UGRL). Our proposed approach introduces a relational inductive bias in the scheduling process and does not require the collection of training feedback information from client devices, unlike state-of-the-art importance-aware mechanisms. We evaluate our proposed solution against baseline scheduling algorithms based on recently proposed metrics in the literature. Results show that, when considering scenarios of nodes exhibiting spatial relations, our approach can achieve an average gain of up to 10% in model accuracy and up to 17 times in energy efficiency compared to state-of-the-art importance-aware policies.
    Incorporating Background Knowledge in Symbolic Regression using a Computer Algebra System. (arXiv:2301.11919v1 [cs.LG])
    Symbolic Regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order-of-magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find Bayesian SR is better these constraints (as the Bayesian prior) than by modifying the fitness function in the GA
    Synth-by-Reg (SbR): Contrastive learning for synthesis-based registration of paired images. (arXiv:2107.14449v3 [cs.CV] UPDATED)
    Nonlinear inter-modality registration is often challenging due to the lack of objective functions that are good proxies for alignment. Here we propose a synthesis-by-registration method to convert this problem into an easier intra-modality task. We introduce a registration loss for weakly supervised image translation between domains that does not require perfectly aligned training data. This loss capitalises on a registration U-Net with frozen weights, to drive a synthesis CNN towards the desired translation. We complement this loss with a structure preserving constraint based on contrastive learning, which prevents blurring and content shifts due to overfitting. We apply this method to the registration of histological sections to MRI slices, a key step in 3D histology reconstruction. Results on two different public datasets show improvements over registration based on mutual information (13% reduction in landmark error) and synthesis-based algorithms such as CycleGAN (11% reduction), and are comparable to a registration CNN with label supervision. Code and data are publicly available at \url{https://github.com/acasamitjana/SynthByReg}
    Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees. (arXiv:2301.11911v1 [cs.LG])
    The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.
    Communication-Efficient Learning of Deep Networks from Decentralized Data. (arXiv:1602.05629v4 [cs.LG] UPDATED)
    Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning. We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10-100x as compared to synchronized stochastic gradient descent.
    OccRob: Efficient SMT-Based Occlusion Robustness Verification of Deep Neural Networks. (arXiv:2301.11912v1 [cs.LG])
    Occlusion is a prevalent and easily realizable semantic perturbation to deep neural networks (DNNs). It can fool a DNN into misclassifying an input image by occluding some segments, possibly resulting in severe errors. Therefore, DNNs planted in safety-critical systems should be verified to be robust against occlusions prior to deployment. However, most existing robustness verification approaches for DNNs are focused on non-semantic perturbations and are not suited to the occlusion case. In this paper, we propose the first efficient, SMT-based approach for formally verifying the occlusion robustness of DNNs. We formulate the occlusion robustness verification problem and prove it is NP-complete. Then, we devise a novel approach for encoding occlusions as a part of neural networks and introduce two acceleration techniques so that the extended neural networks can be efficiently verified using off-the-shelf, SMT-based neural network verification tools. We implement our approach in a prototype called OccRob and extensively evaluate its performance on benchmark datasets with various occlusion variants. The experimental results demonstrate our approach's effectiveness and efficiency in verifying DNNs' robustness against various occlusions, and its ability to generate counterexamples when these DNNs are not robust.
    Projected Subnetworks Scale Adaptation. (arXiv:2301.11487v1 [cs.LG])
    Large models support great zero-shot and few-shot capabilities. However, updating these models on new tasks can break performance on previous seen tasks and their zero/few-shot unseen tasks. Our work explores how to update zero/few-shot learners such that they can maintain performance on seen/unseen tasks of previous tasks as well as new tasks. By manipulating the parameter updates of a gradient-based meta learner as the projected task-specific subnetworks, we show improvements for large models to retain seen and zero/few shot task performance in online settings.
    Streaming LifeLong Learning With Any-Time Inference. (arXiv:2301.11892v1 [cs.LG])
    Despite rapid advancements in lifelong learning (LLL) research, a large body of research mainly focuses on improving the performance in the existing \textit{static} continual learning (CL) setups. These methods lack the ability to succeed in a rapidly changing \textit{dynamic} environment, where an AI agent needs to quickly learn new instances in a `single pass' from the non-i.i.d (also possibly temporally contiguous/coherent) data streams without suffering from catastrophic forgetting. For practical applicability, we propose a novel lifelong learning approach, which is streaming, i.e., a single input sample arrives in each time step, single pass, class-incremental, and subject to be evaluated at any moment. To address this challenging setup and various evaluation protocols, we propose a Bayesian framework, that enables fast parameter update, given a single training example, and enables any-time inference. We additionally propose an implicit regularizer in the form of snap-shot self-distillation, which effectively minimizes the forgetting further. We further propose an effective method that efficiently selects a subset of samples for online memory rehearsal and employs a new replay buffer management scheme that significantly boosts the overall performance. Our empirical evaluations and ablations demonstrate that the proposed method outperforms the prior works by large margins.
    Graph-Free Learning in Graph-Structured Data: A More Efficient and Accurate Spatiotemporal Learning Perspective. (arXiv:2301.11742v1 [cs.LG])
    Spatiotemporal learning, which aims at extracting spatiotemporal correlations from the collected spatiotemporal data, is a research hotspot in recent years. And considering the inherent graph structure of spatiotemporal data, recent works focus on capturing spatial dependencies by utilizing Graph Convolutional Networks (GCNs) to aggregate vertex features with the guidance of adjacency matrices. In this paper, with extensive and deep-going experiments, we comprehensively analyze existing spatiotemporal graph learning models and reveal that extracting adjacency matrices with carefully design strategies, which are viewed as the key of enhancing performance on graph learning, are largely ineffective. Meanwhile, based on these experiments, we also discover that the aggregation itself is more important than the way that how vertices are aggregated. With these preliminary, a novel efficient Graph-Free Spatial (GFS) learning module based on layer normalization for capturing spatial correlations in spatiotemporal graph learning. The proposed GFS module can be easily plugged into existing models for replacing all graph convolution components. Rigorous theoretical proof demonstrates that the time complexity of GFS is significantly better than that of graph convolution operation. Extensive experiments verify the superiority of GFS in both the perspectives of efficiency and learning effect in processing graph-structured data especially extreme large scale graph data.  ( 2 min )
    ActiveLab: Active Learning with Re-Labeling by Multiple Annotators. (arXiv:2301.11856v1 [cs.LG])
    In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.
    Element selection for functional materials discovery by integrated machine learning of elemental contributions to properties. (arXiv:2202.01051v2 [cond-mat.mtrl-sci] UPDATED)
    Fundamental differences between materials originate from the unique nature of their constituent chemical elements. Before specific differences emerge according to the precise ratios of elements in a given crystal structure, a material can be represented by the set of its constituent chemical elements. By working at the level of the periodic table, assessment of materials at the level of their phase fields reduces the combinatorial complexity to accelerate screening, and circumvents the challenges associated with composition-level approaches such as poor extrapolation within phase fields, and the impossibility of exhaustive sampling. This early stage discrimination combined with evaluation of novelty of phase fields aligns with the outstanding experimental challenge of identifying new areas of chemistry to investigate, by prioritising which elements to combine in a reaction. Here, we demonstrate that phase fields can be assessed with respect to the maximum expected value of a target functional property and ranked according to chemical novelty. We develop and present PhaseSelect, an end-to-end machine learning model that combines the representation, classification, regression and ranking of phase fields. First, PhaseSelect constructs elemental characteristics from the co-occurrence of chemical elements in computationally and experimentally reported materials, then it employs attention mechanisms to learn representation for phase fields and assess their functional performance. At the level of the periodic table, PhaseSelect quantifies the probability of observing a functional property, estimates its value within a phase field and also ranks a phase field novelty, which we demonstrate with significant accuracy for three avenues of materials applications for high-temperature superconductivity, high-temperature magnetism, and targeted bandgap energy.
    Is TinyML Sustainable? Assessing the Environmental Impacts of Machine Learning on Microcontrollers. (arXiv:2301.11899v1 [cs.LG])
    The sustained growth of carbon emissions and global waste elicits significant sustainability concerns for our environment's future. The growing Internet of Things (IoT) has the potential to exacerbate this issue. However, an emerging area known as Tiny Machine Learning (TinyML) has the opportunity to help address these environmental challenges through sustainable computing practices. TinyML, the deployment of machine learning (ML) algorithms onto low-cost, low-power microcontroller systems, enables on-device sensor analytics that unlocks numerous always-on ML applications. This article discusses the potential of these TinyML applications to address critical sustainability challenges. Moreover, the footprint of this emerging technology is assessed through a complete life cycle analysis of TinyML systems. From this analysis, TinyML presents opportunities to offset its carbon emissions by enabling applications that reduce the emissions of other sectors. Nevertheless, when globally scaled, the carbon footprint of TinyML systems is not negligible, necessitating that designers factor in environmental impact when formulating new devices. Finally, research directions for enabling further opportunities for TinyML to contribute to a sustainable future are outlined.
    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient. (arXiv:2301.11913v1 [cs.DC])
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.
    Policy-Value Alignment and Robustness in Search-based Multi-Agent Learning. (arXiv:2301.11857v1 [cs.AI])
    Large-scale AI systems that combine search and learning have reached super-human levels of performance in game-playing, but have also been shown to fail in surprising ways. The brittleness of such models limits their efficacy and trustworthiness in real-world deployments. In this work, we systematically study one such algorithm, AlphaZero, and identify two phenomena related to the nature of exploration. First, we find evidence of policy-value misalignment -- for many states, AlphaZero's policy and value predictions contradict each other, revealing a tension between accurate move-selection and value estimation in AlphaZero's objective. Further, we find inconsistency within AlphaZero's value function, which causes it to generalize poorly, despite its policy playing an optimal strategy. From these insights we derive VISA-VIS: a novel method that improves policy-value alignment and value robustness in AlphaZero. Experimentally, we show that our method reduces policy-value misalignment by up to 76%, reduces value generalization error by up to 50%, and reduces average value error by up to 55%.
    Automatic Modulation Classification with Deep Neural Networks. (arXiv:2301.11773v1 [cs.LG])
    Automatic modulation classification is a desired feature in many modern software-defined radios. In recent years, a number of convolutional deep learning architectures have been proposed for automatically classifying the modulation used on observed signal bursts. However, a comprehensive analysis of these differing architectures and importance of each design element has not been carried out. Thus it is unclear what tradeoffs the differing designs of these convolutional neural networks might have. In this research, we investigate numerous architectures for automatic modulation classification and perform a comprehensive ablation study to investigate the impacts of varying hyperparameters and design elements on automatic modulation classification performance. We show that a new state of the art in performance can be achieved using a subset of the studied design elements. In particular, we show that a combination of dilated convolutions, statistics pooling, and squeeze-and-excitation units results in the strongest performing classifier. We further investigate this best performer according to various other criteria, including short signal bursts, common misclassifications, and performance across differing modulation categories and modes.  ( 2 min )
    Conformal inference is (almost) free for neural networks trained with early stopping. (arXiv:2301.11556v1 [stat.ML])
    Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stopping often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data.
    I Prefer not to Say: Are Users Penalized for Protecting Personal Data?. (arXiv:2210.13954v3 [cs.LG] UPDATED)
    We examine the problem of obtaining fair outcomes for individuals who choose to share optional information with machine-learned models and those who do not consent and keep their data undisclosed. We find that these non-consenting users receive significantly lower prediction outcomes than justified by their provided information alone. This observation gives rise to the overlooked problem of how to ensure that users, who protect their personal data, are not penalized. While statistical fairness notions focus on fair outcomes between advantaged and disadvantaged groups, these fairness notions fail to protect the non-consenting users. To address this problem, we formalize protection requirements for models which (i) allow users to benefit from sharing optional information and (ii) do not penalize them if they keep their data undisclosed. We offer the first solution to this problem by proposing the notion of Optional Feature Fairness (OFF), which we prove to be loss-optimal under our protection requirements (i) and (ii). To learn OFF-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we extensively analyze OFF on a variety of challenging real-world tasks, models, and data sets with multiple optional features.
    Aleatoric and Epistemic Discrimination in Classification. (arXiv:2301.11781v1 [cs.LG])
    Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.  ( 2 min )
    Meta-Learning Mini-Batch Risk Functionals. (arXiv:2301.11724v1 [cs.LG])
    Supervised learning typically optimizes the expected value risk functional of the loss, but in many cases, we want to optimize for other risk functionals. In full-batch gradient descent, this is done by taking gradients of a risk functional of interest, such as the Conditional Value at Risk (CVaR) which ignores some quantile of extreme losses. However, deep learning must almost always use mini-batch gradient descent, and lack of unbiased estimators of various risk functionals make the right optimization procedure unclear. In this work, we introduce a meta-learning-based method of learning an interpretable mini-batch risk functional during model training, in a single shot. When optimizing for various risk functionals, the learned mini-batch risk functions lead to risk reduction of up to 10% over hand-engineered mini-batch risk functionals. Then in a setting where the right risk functional is unknown a priori, our method improves over baseline by 14% relative (~9% absolute). We analyze the learned mini-batch risk functionals at different points through training, and find that they learn a curriculum (including warm-up periods), and that their final form can be surprisingly different from the underlying risk functional that they optimize for.
    Diverse Weight Averaging for Out-of-Distribution Generalization. (arXiv:2205.09739v2 [cs.CV] UPDATED)
    Standard neural networks struggle to generalize under distribution shifts in computer vision. Fortunately, combining multiple networks can consistently improve out-of-distribution generalization. In particular, weight averaging (WA) strategies were shown to perform best on the competitive DomainBed benchmark; they directly average the weights of multiple networks despite their nonlinearities. In this paper, we propose Diverse Weight Averaging (DiWA), a new WA strategy whose main motivation is to increase the functional diversity across averaged models. To this end, DiWA averages weights obtained from several independent training runs: indeed, models obtained from different runs are more diverse than those collected along a single run thanks to differences in hyperparameters and training procedures. We motivate the need for diversity by a new bias-variance-covariance-locality decomposition of the expected error, exploiting similarities between WA and standard functional ensembling. Moreover, this decomposition highlights that WA succeeds when the variance term dominates, which we show occurs when the marginal distribution changes at test time. Experimentally, DiWA consistently improves the state of the art on DomainBed without inference overhead.
    Leveraging the Third Dimension in Contrastive Learning. (arXiv:2301.11790v1 [cs.CV])
    Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.
    Hyperbolic VAE via Latent Gaussian Distributions. (arXiv:2209.15217v2 [cs.LG] UPDATED)
    We propose a Gaussian manifold variational auto-encoder (GM-VAE) whose latent space consists of a set of Gaussian distributions. It is known that the set of the univariate Gaussian distributions with the Fisher information metric form a hyperbolic space, which we call a Gaussian manifold. To learn the VAE endowed with the Gaussian manifolds, we propose a pseudo-Gaussian manifold normal distribution based on the Kullback-Leibler divergence, a local approximation of the squared Fisher-Rao distance, to define a density over the latent space. In experiments, we demonstrate the efficacy of GM-VAE on two different tasks: density estimation of image datasets and environment modeling in model-based reinforcement learning. GM-VAE outperforms the other variants of hyperbolic- and Euclidean-VAEs on density estimation tasks and shows competitive performance in model-based reinforcement learning. We observe that our model provides strong numerical stability, addressing a common limitation reported in previous hyperbolic-VAEs.
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.
    Provably Efficient Causal Model-Based Reinforcement Learning for Systematic Generalization. (arXiv:2202.06545v2 [cs.LG] UPDATED)
    In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environment-specific and another that is shared. Consider a set of environments that share the laws of motion as an example. In this setting, the agent can take a finite amount of reward-free interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide a tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that guarantees any desired planning error up to an unavoidable sub-optimality term, while showcasing a polynomial sample complexity.
    Deep Multi-modal Fusion of Image and Non-image Data in Disease Diagnosis and Prognosis: A Review. (arXiv:2203.15588v3 [cs.LG] UPDATED)
    The rapid development of diagnostic technologies in healthcare is leading to higher requirements for physicians to handle and integrate the heterogeneous, yet complementary data that are produced during routine practice. For instance, the personalized diagnosis and treatment planning for a single cancer patient relies on the various images (e.g., radiological, pathological, and camera images) and non-image data (e.g., clinical data and genomic data). However, such decision-making procedures can be subjective, qualitative, and have large inter-subject variabilities. With the recent advances in multi-modal deep learning technologies, an increasingly large number of efforts have been devoted to a key question: how do we extract and aggregate multi-modal information to ultimately provide more objective, quantitative computer-aided clinical decision making? This paper reviews the recent studies on dealing with such a question. Briefly, this review will include the (1) overview of current multi-modal learning workflows, (2) summarization of multi-modal fusion methods, (3) discussion of the performance, (4) applications in disease diagnosis and prognosis, and (5) challenges and future directions.
    Constrained Clustering: General Pairwise and Cardinality Constraints. (arXiv:1907.10410v2 [cs.LG] UPDATED)
    We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.
    Deep Clustering Survival Machines with Interpretable Expert Distributions. (arXiv:2301.11826v1 [cs.LG])
    Conventional survival analysis methods are typically ineffective to characterize heterogeneity in the population while such information can be used to assist predictive modeling. In this study, we propose a hybrid survival analysis method, referred to as deep clustering survival machines, that combines the discriminative and generative mechanisms. Similar to the mixture models, we assume that the timing information of survival data is generatively described by a mixture of certain numbers of parametric distributions, i.e., expert distributions. We learn weights of the expert distributions for individual instances according to their features discriminatively such that each instance's survival information can be characterized by a weighted combination of the learned constant expert distributions. This method also facilitates interpretable subgrouping/clustering of all instances according to their associated expert distributions. Extensive experiments on both real and synthetic datasets have demonstrated that the method is capable of obtaining promising clustering results and competitive time-to-event predicting performance.
    A Game-Theoretic Framework for Managing Risk in Multi-Agent Systems. (arXiv:2205.15434v2 [cs.LG] UPDATED)
    In order for agents in multi-agent systems (MAS) to be safe, they need to take into account the risks posed by the actions of other agents. However, the dominant paradigm in game theory (GT) assumes that agents are not affected by risk from other agents and only strive to maximise their expected utility. For example, in hybrid human-AI driving systems, it is necessary to limit large deviations in reward resulting from car crashes. Although there are equilibrium concepts in game theory that take into account risk aversion, they either assume that agents are risk-neutral with respect to the uncertainty caused by the actions of other agents, or they are not guaranteed to exist. We introduce a new GT-based Risk-Averse Equilibrium (RAE) that always produces a solution that minimises the potential variance in reward accounting for the strategy of other agents. Theoretically and empirically, we show RAE shares many properties with a Nash Equilibrium (NE), establishing convergence properties and generalising to risk-dominant NE in certain cases. To tackle large-scale problems, we extend RAE to the PSRO multi-agent reinforcement learning (MARL) framework. We empirically demonstrate the minimum reward variance benefits of RAE in matrix games with high-risk outcomes. Results on MARL experiments show RAE generalises to risk-dominant NE in a trust dilemma game and that it reduces instances of crashing by 7x in an autonomous driving setting versus the best performing baseline.
    DAG Learning on the Permutahedron. (arXiv:2301.11898v1 [cs.LG])
    We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.
    Robust variance-regularized risk minimization with concomitant scaling. (arXiv:2301.11584v1 [stat.ML])
    Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.  ( 2 min )
    Sparse Mixture-of-Experts are Domain Generalizable Learners. (arXiv:2206.04046v6 [cs.CV] UPDATED)
    Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment with the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely Generalizable Mixture-of-Experts (GMoE). Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.
    PECAN: A Deterministic Certified Defense Against Backdoor Attacks. (arXiv:2301.11824v1 [cs.CR])
    Neural networks are vulnerable to backdoor poisoning attacks, where the attackers maliciously poison the training set and insert triggers into the test input to change the prediction of the victim model. Existing defenses for backdoor attacks either provide no formal guarantees or come with expensive-to-compute and ineffective probabilistic guarantees. We present PECAN, an efficient and certified approach for defending against backdoor attacks. The key insight powering PECAN is to apply off-the-shelf test-time evasion certification techniques on a set of neural networks trained on disjoint partitions of the data. We evaluate PECAN on image classification and malware detection datasets. Our results demonstrate that PECAN can (1) significantly outperform the state-of-the-art certified backdoor defense, both in defense strength and efficiency, and (2) on real back-door attacks, PECAN can reduce attack success rate by order of magnitude when compared to a range of baselines from the literature.
    Naive Few-Shot Learning: Uncovering the fluid intelligence of machines. (arXiv:2205.12013v3 [cs.AI] UPDATED)
    In this paper, we aimed to help bridge the gap between human fluid intelligence - the ability to solve novel tasks without prior training - and the performance of deep neural networks, which typically require extensive prior training. An essential cognitive component for solving intelligence tests, which in humans are used to measure fluid intelligence, is the ability to identify regularities in sequences. This motivated us to construct a benchmark task, which we term \textit{sequence consistency evaluation} (SCE), whose solution requires the ability to identify regularities in sequences. Given the proven capabilities of deep networks, their ability to solve such tasks after extensive training is expected. Surprisingly, however, we show that naive (randomly initialized) deep learning models that are trained on a \textit{single} SCE with a \textit{single} optimization step can still solve non-trivial versions of the task relatively well. We extend our findings to solve, without any prior training, real-world anomaly detection tasks in the visual and auditory modalities. These results demonstrate the fluid-intelligent computational capabilities of deep networks. We discuss the implications of our work for constructing fluid-intelligent machines.
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v2 [math.ST] UPDATED)
    Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.
    Accelerating Domain-aware Deep Learning Models with Distributed Training. (arXiv:2301.11787v1 [cs.LG])
    Recent advances in data-generating techniques led to an explosive growth of geo-spatiotemporal data. In domains such as hydrology, ecology, and transportation, interpreting the complex underlying patterns of spatiotemporal interactions with the help of deep learning techniques hence becomes the need of the hour. However, applying deep learning techniques without domain-specific knowledge tends to provide sub-optimal prediction performance. Secondly, training such models on large-scale data requires extensive computational resources. To eliminate these challenges, we present a novel distributed domain-aware spatiotemporal network that utilizes domain-specific knowledge with improved model performance. Our network consists of a pixel-contribution block, a distributed multiheaded multichannel convolutional (CNN) spatial block, and a recurrent temporal block. We choose flood prediction in hydrology as a use case to test our proposed method. From our analysis, the network effectively predicts high peaks in discharge measurements at watershed outlets with up to 4.1x speedup and increased prediction performance of up to 93\%. Our approach achieved a 12.6x overall speedup and increased the mean prediction performance by 16\%. We perform extensive experiments on a dataset of 23 watersheds in a northern state of the U.S. and present our findings.
    Overparameterized Linear Regression under Adversarial Attacks. (arXiv:2204.06274v2 [stat.ML] UPDATED)
    We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.
    Feature Selection on Quantum Computers. (arXiv:2203.13261v2 [quant-ph] UPDATED)
    In machine learning, fewer features reduce model complexity. Carefully assessing the influence of each input feature on the model quality is therefore a crucial preprocessing step. We propose a novel feature selection algorithm based on a quadratic unconstrained binary optimization (QUBO) problem, which allows to select a specified number of features based on their importance and redundancy. In contrast to iterative or greedy methods, our direct approach yields higherquality solutions. QUBO problems are particularly interesting because they can be solved on quantum hardware. To evaluate our proposed algorithm, we conduct a series of numerical experiments using a classical computer, a quantum gate computer and a quantum annealer. Our evaluation compares our method to a range of standard methods on various benchmark datasets. We observe competitive performance.  ( 2 min )
    Constrained Monotonic Neural Networks. (arXiv:2205.11775v2 [cs.LG] UPDATED)
    Deep neural networks are becoming increasingly popular in approximating arbitrary functions from noisy data. But wider adoption is being hindered by the need to explain such models and to impose additional constraints on them. Monotonicity constraint is one of the most requested properties in real-world scenarios and is the focus of this paper. One of the oldest ways to construct a monotonic fully connected neural network is to constrain its weights to be non-negative while employing a monotonic activation function. Unfortunately, this construction does not work with popular non-saturated activation functions such as ReLU, ELU, SELU etc, as it can only approximate convex functions. We show this shortcoming can be fixed by employing the original activation function for a part of the neurons in the layer, and employing its point reflection for the other part. Our experiments show this approach of building monotonic deep neural networks have matching or better accuracy when compared to other state-of-the-art methods such as deep lattice networks or monotonic networks obtained by heuristic regularization. This method is the simplest one in the sense of having the least number of parameters, not requiring any modifications to the learning procedure or steps post-learning steps.
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v3 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.
    SOBER: Scalable Batch Bayesian Optimization and Quadrature using Recombination Constraints. (arXiv:2301.11832v1 [cs.LG])
    Batch Bayesian optimisation (BO) has shown to be a sample-efficient method of performing optimisation where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch BO with arbitrary acquisition functions, arbitrary input spaces (e.g. graph), and arbitrary kernels. The key to our approach is to reformulate batch selection for BO as a Bayesian quadrature (BQ) problem, which offers computational advantages. This reformulation is beneficial in solving BQ tasks reciprocally, which introduces the exploitative functionality of BO to BQ. We show that SOBER offers substantive performance gains in synthetic and real-world tasks, including drug discovery and simulation-based inference.
    Image Restoration with Mean-Reverting Stochastic Differential Equations. (arXiv:2301.11699v1 [cs.LG])
    This paper presents a stochastic differential equation (SDE) approach for general-purpose image restoration. The key construction consists in a mean-reverting SDE that transforms a high-quality image into a degraded counterpart as a mean state with fixed Gaussian noise. Then, by simulating the corresponding reverse-time SDE, we are able to restore the origin of the low-quality image without relying on any task-specific prior knowledge. Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing us to compute the ground truth time-dependent score and learn it with a neural network. Moreover, we propose a maximum likelihood objective to learn an optimal reverse trajectory which stabilizes the training and improves the restoration results. In the experiments, we show that our proposed method achieves highly competitive performance in quantitative comparisons on image deraining, deblurring, and denoising, setting a new state-of-the-art on two deraining datasets. Finally, the general applicability of our approach is further demonstrated via qualitative results on image super-resolution, inpainting, and dehazing. Code is available at \url{https://github.com/Algolzw/image-restoration-sde}.
    Efficiently predicting high resolution mass spectra with graph neural networks. (arXiv:2301.11419v1 [cs.LG])
    Identifying a small molecule from its mass spectrum is the primary open problem in computational metabolomics. This is typically cast as information retrieval: an unknown spectrum is matched against spectra predicted computationally from a large database of chemical structures. However, current approaches to spectrum prediction model the output space in ways that force a tradeoff between capturing high resolution mass information and tractable learning. We resolve this tradeoff by casting spectrum prediction as a mapping from an input molecular graph to a probability distribution over molecular formulas. We discover that a large corpus of mass spectra can be closely approximated using a fixed vocabulary constituting only 2% of all observed formulas. This enables efficient spectrum prediction using an architecture similar to graph classification - GrAFF-MS - achieving significantly lower prediction error and orders-of-magnitude faster runtime than state-of-the-art methods.
    Constrained Submodular Optimization for Vaccine Design. (arXiv:2206.08336v2 [q-bio.QM] UPDATED)
    Advances in machine learning have enabled the prediction of immune system responses to prophylactic and therapeutic vaccines. However, the engineering task of designing vaccines remains a challenge. In particular, the genetic variability of the human immune system makes it difficult to design peptide vaccines that provide widespread immunity in vaccinated populations. We introduce a framework for evaluating and designing peptide vaccines that uses probabilistic machine learning models, and demonstrate its ability to produce designs for a SARS-CoV-2 vaccine that outperform previous designs. We provide a theoretical analysis of the approximability, scalability, and complexity of our framework.
    Regret Analysis of Learning-Based MPC with Partially-Unknown Cost Function. (arXiv:2108.02307v2 [math.OC] UPDATED)
    The exploration/exploitation trade-off is an inherent challenge in data-driven adaptive control. Though this trade-off has been studied for multi-armed bandits (MAB's) and reinforcement learning for linear systems; it is less well-studied for learning-based control of nonlinear systems. A significant theoretical challenge in the nonlinear setting is that there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose the use of a finite-horizon oracle controller with full knowledge of parameters as a reasonable surrogate to optimal controller. This allows us to develop policies in the context of learning-based MPC and MAB's and conduct a control-theoretic analysis using techniques from MPC- and optimization-theory to show these policies achieve low regret with respect to this finite-horizon oracle. Our simulations exhibit the low regret of our policy on a heating, ventilation, and air-conditioning model with partially-unknown cost function.  ( 2 min )
    Achieving Risk Control in Online Learning Settings. (arXiv:2205.09095v7 [cs.LG] UPDATED)
    To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk -- such as coverage of confidence intervals, false negative rate, or F1 score -- in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.
    Safe Posterior Sampling for Constrained MDPs with Bounded Constraint Violation. (arXiv:2301.11547v1 [cs.LG])
    Constrained Markov decision processes (CMDPs) model scenarios of sequential decision making with multiple objectives that are increasingly important in many applications. However, the model is often unknown and must be learned online while still ensuring the constraint is met, or at least the violation is bounded with time. Some recent papers have made progress on this very challenging problem but either need unsatisfactory assumptions such as knowledge of a safe policy, or have high cumulative regret. We propose the Safe PSRL (posterior sampling-based RL) algorithm that does not need such assumptions and yet performs very well, both in terms of theoretical regret bounds as well as empirically. The algorithm achieves an efficient tradeoff between exploration and exploitation by use of the posterior sampling principle, and provably suffers only bounded constraint violation by leveraging the idea of pessimism. Our approach is based on a primal-dual approach. We establish a sub-linear $\tilde{\mathcal{ O}}\left(H^{2.5} \sqrt{|\mathcal{S}|^2 |\mathcal{A}| K} \right)$ upper bound on the Bayesian reward objective regret along with a bounded, i.e., $\tilde{\mathcal{O}}\left(1\right)$ constraint violation regret over $K$ episodes for an $|\mathcal{S}|$-state, $|\mathcal{A}|$-action and horizon $H$ CMDP.  ( 2 min )
    Targeted Attacks on Timeseries Forecasting. (arXiv:2301.11544v1 [cs.LG])
    Real-world deep learning models developed for Time Series Forecasting are used in several critical applications ranging from medical devices to the security domain. Many previous works have shown how deep learning models are prone to adversarial attacks and studied their vulnerabilities. However, the vulnerabilities of time series models for forecasting due to adversarial inputs are not extensively explored. While the attack on a forecasting model might aim to deteriorate the performance of the model, it is more effective, if the attack is focused on a specific impact on the model's output. In this paper, we propose a novel formulation of Directional, Amplitudinal, and Temporal targeted adversarial attacks on time series forecasting models. These targeted attacks create a specific impact on the amplitude and direction of the output prediction. We use the existing adversarial attack techniques from the computer vision domain and adapt them for time series. Additionally, we propose a modified version of the Auto Projected Gradient Descent attack for targeted attacks. We examine the impact of the proposed targeted attacks versus untargeted attacks. We use KS-Tests to statistically demonstrate the impact of the attack. Our experimental results show how targeted attacks on time series models are viable and are more powerful in terms of statistical similarity. It is, hence difficult to detect through statistical methods. We believe that this work opens a new paradigm in the time series forecasting domain and represents an important consideration for developing better defenses.  ( 2 min )
    Decentralized Online Bandit Optimization on Directed Graphs with Regret Bounds. (arXiv:2301.11802v1 [cs.LG])
    We consider a decentralized multiplayer game, played over $T$ rounds, with a leader-follower hierarchy described by a directed acyclic graph. For each round, the graph structure dictates the order of the players and how players observe the actions of one another. By the end of each round, all players receive a joint bandit-reward based on their joint action that is used to update the player strategies towards the goal of minimizing the joint pseudo-regret. We present a learning algorithm inspired by the single-player multi-armed bandit problem and show that it achieves sub-linear joint pseudo-regret in the number of rounds for both adversarial and stochastic bandit rewards. Furthermore, we quantify the cost incurred due to the decentralized nature of our problem compared to the centralized setting.
    Multimodal and Explainable Internet Meme Classification. (arXiv:2212.05612v2 [cs.AI] UPDATED)
    Warning: this paper contains content that may be offensive or upsetting. In the current context where online platforms have been effectively weaponized in a variety of geo-political events and social issues, Internet memes make fair content moderation at scale even more difficult. Existing work on meme classification and tracking has focused on black-box methods that do not explicitly consider the semantics of the memes or the context of their creation. In this paper, we pursue a modular and explainable architecture for Internet meme understanding. We design and implement multimodal classification methods that perform example- and prototype-based reasoning over training cases, while leveraging both textual and visual SOTA models to represent the individual cases. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We compare the performance between example- and prototype-based methods, and between text, vision, and multimodal models, across different categories of harmfulness (e.g., stereotype and objectification). We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme, informing the community about the strengths and limitations of these explainable methods.  ( 2 min )
    Certified Invertibility in Neural Networks via Mixed-Integer Programming. (arXiv:2301.11783v1 [cs.LG])
    Neural networks are notoriously vulnerable to adversarial attacks -- small imperceptible perturbations that can change the network's output drastically. In the reverse direction, there may exist large, meaningful perturbations that leave the network's decision unchanged (excessive invariance, nonivertibility). We study the latter phenomenon in two contexts: (a) discrete-time dynamical system identification, as well as (b) calibration of the output of one neural network to the output of another (neural network matching). For ReLU networks and $L_p$ norms ($p=1,2,\infty$), we formulate these optimization problems as mixed-integer programs (MIPs) that apply to neural network approximators of dynamical systems. We also discuss the applicability of our results to invertibility certification in transformations between neural networks (e.g. at different levels of pruning).
    Personalised Federated Learning On Heterogeneous Feature Spaces. (arXiv:2301.11447v1 [cs.LG])
    Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.  ( 2 min )
    Distributionally Robust Multi-objective Bayesian Optimization under Uncertain Environments. (arXiv:2301.11588v1 [stat.ML])
    In this study, we address the problem of optimizing multi-output black-box functions under uncertain environments. We formulate this problem as the estimation of the uncertain Pareto-frontier (PF) of a multi-output Bayesian surrogate model with two types of variables: design variables and environmental variables. We consider this problem within the context of Bayesian optimization (BO) under uncertain environments, where the design variables are controllable, whereas the environmental variables are assumed to be random and not controllable. The challenge of this problem is to robustly estimate the PF when the distribution of the environmental variables is unknown, that is, to estimate the PF when the environmental variables are generated from the worst possible distribution. We propose a method for solving the BO problem by appropriately incorporating the uncertainties of the environmental variables and their probability distribution. We demonstrate that the proposed method can find an arbitrarily accurate PF with high probability in a finite number of iterations. We also evaluate the performance of the proposed method through numerical experiments.  ( 2 min )
    Big portfolio selection by graph-based conditional moments method. (arXiv:2301.11697v1 [stat.ML])
    How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis.
    Collaborative Regret Minimization in Multi-Armed Bandits. (arXiv:2301.11442v1 [cs.LG])
    In this paper, we study the collaborative learning model, which concerns the tradeoff between parallelism and communication overhead in multi-agent reinforcement learning. For a fundamental problem in bandit theory, regret minimization in multi-armed bandits, we present the first and almost tight tradeoffs between the number of rounds of communication between the agents and the regret of the collaborative learning process.  ( 2 min )
    Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence. (arXiv:2301.11476v1 [cs.LG])
    Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the $q$-logarithm in the definition. The approach is a strict generalization, as $q = 1$ corresponds to the standard KL divergence; $q > 1$ provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when $q >1$ could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI($q$) obtains significant improvements over the standard MVI($q = 1$) across 35 Atari games.
    Neural Abstractions. (arXiv:2301.11683v1 [cs.LO])
    We present a novel method for the safety verification of nonlinear dynamical models that uses neural networks to represent abstractions of their dynamics. Neural networks have extensively been used before as approximators; in this work, we make a step further and use them for the first time as abstractions. For a given dynamical model, our method synthesises a neural network that overapproximates its dynamics by ensuring an arbitrarily tight, formally certified bound on the approximation error. For this purpose, we employ a counterexample-guided inductive synthesis procedure. We show that this produces a neural ODE with non-deterministic disturbances that constitutes a formal abstraction of the concrete model under analysis. This guarantees a fundamental property: if the abstract model is safe, i.e., free from any initialised trajectory that reaches an undesirable state, then the concrete model is also safe. By using neural ODEs with ReLU activation functions as abstractions, we cast the safety verification problem for nonlinear dynamical models into that of hybrid automata with affine dynamics, which we verify using SpaceEx. We demonstrate that our approach performs comparably to the mature tool Flow* on existing benchmark nonlinear models. We additionally demonstrate and that it is effective on models that do not exhibit local Lipschitz continuity, which are out of reach to the existing technologies.
    Semi-Parametric Video-Grounded Text Generation. (arXiv:2301.11507v1 [cs.CV])
    Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames. Parametric approaches such as the attention mechanism may not be ideal since its computational cost quadratically increases as the video length increases. Rather, previous studies have relied on offline feature extraction or frame sampling to represent the video efficiently, focusing on cross-modal modeling in short video clips. In this paper, we propose a semi-parametric video-grounded text generation model, SeViT, a novel perspective on scalable video-language modeling toward long untrimmed videos. Treating a video as an external data store, SeViT includes a non-parametric frame retriever to select a few query-relevant frames from the data store for a given query and a parametric generator to effectively aggregate the frames with the query via late fusion methods. Experimental results demonstrate our method has a significant advantage in longer videos and causal video understanding. Moreover, our model achieves the new state of the art on four video-language datasets, iVQA (+4.8), Next-QA (+6.9), and Activitynet-QA (+4.8) in accuracy, and MSRVTT-Caption (+3.6) in CIDEr.
    Single-Trajectory Distributionally Robust Reinforcement Learning. (arXiv:2301.11721v1 [stat.ML])
    As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.
    Outcome-directed Reinforcement Learning by Uncertainty & Temporal Distance-Aware Curriculum Goal Generation. (arXiv:2301.11741v1 [cs.LG])
    Current reinforcement learning (RL) often suffers when solving a challenging exploration problem where the desired outcomes or high rewards are rarely observed. Even though curriculum RL, a framework that solves complex tasks by proposing a sequence of surrogate tasks, shows reasonable results, most of the previous works still have difficulty in proposing curriculum due to the absence of a mechanism for obtaining calibrated guidance to the desired outcome state without any prior domain knowledge. To alleviate it, we propose an uncertainty & temporal distance-aware curriculum goal generation method for the outcome-directed RL via solving a bipartite matching problem. It could not only provide precisely calibrated guidance of the curriculum to the desired outcome states but also bring much better sample efficiency and geometry-agnostic curriculum goal proposal capability compared to previous curriculum RL methods. We demonstrate that our algorithm significantly outperforms these prior methods in a variety of challenging navigation tasks and robotic manipulation tasks in a quantitative and qualitative way.
    Semi-Supervised Machine Learning: a Homological Approach. (arXiv:2301.11658v1 [cs.LG])
    In this paper we describe the mathematical foundations of a new approach to semi-supervised Machine Learning. Using techniques of Symbolic Computation and Computer Algebra, we apply the concept of persistent homology to obtain a new semi-supervised learning method.
    Modeling human road crossing decisions as reward maximization with visual perception limitations. (arXiv:2301.11737v1 [cs.LG])
    Understanding the interaction between different road users is critical for road safety and automated vehicles (AVs). Existing mathematical models on this topic have been proposed based mostly on either cognitive or machine learning (ML) approaches. However, current cognitive models are incapable of simulating road user trajectories in general scenarios, and ML models lack a focus on the mechanisms generating the behavior and take a high-level perspective which can cause failures to capture important human-like behaviors. Here, we develop a model of human pedestrian crossing decisions based on computational rationality, an approach using deep reinforcement learning (RL) to learn boundedly optimal behavior policies given human constraints, in our case a model of the limited human visual system. We show that the proposed combined cognitive-RL model captures human-like patterns of gap acceptance and crossing initiation time. Interestingly, our model's decisions are sensitive to not only the time gap, but also the speed of the approaching vehicle, something which has been described as a "bias" in human gap acceptance behavior. However, our results suggest that this is instead a rational adaption to human perceptual limitations. Moreover, we demonstrate an approach to accounting for individual differences in computational rationality models, by conditioning the RL policy on the parameters of the human constraints. Our results demonstrate the feasibility of generating more human-like road user behavior by combining RL with cognitive models.
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v1 [stat.ML])
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.
    Synopsis: Sequential Decision Problems with Weak Feedback. (arXiv:2212.11599v2 [cs.LG] UPDATED)
    This thesis considers sequential decision problems, where the loss/reward incurred by selecting an action may not be inferred from observed feedback. A major part of this thesis focuses on the unsupervised sequential selection problem, where one can not infer the loss incurred for selecting an action from observed feedback. We also introduce a new setup named Censored Semi Bandits, where the loss incurred for selecting an action can be observed under certain conditions. Finally, we study the channel selection problem in the communication networks, where the reward for an action is only observed when no other player selects that action to play in the round. These problems find applications in many fields like healthcare, crowd-sourcing, security, adaptive resource allocation, among many others. This thesis aims to address the above-described sequential decision problems by exploiting specific structures these problems exhibit. We develop provably optimal algorithms for each of these setups with weak feedback and validate their empirical performance on different problem instances derived from synthetic and real datasets.  ( 2 min )
    SLCNN: Sentence-Level Convolutional Neural Network for Text Classification. (arXiv:2301.11696v1 [cs.CL])
    Text classification is a fundamental task in natural language processing (NLP). Several recent studies show the success of deep learning on text processing. Convolutional neural network (CNN), as a popular deep learning model, has shown remarkable success in the task of text classification. In this paper, new baseline models have been studied for text classification using CNN. In these models, documents are fed to the network as a three-dimensional tensor representation to provide sentence-level analysis. Applying such a method enables the models to take advantage of the positional information of the sentences in the text. Besides, analysing adjacent sentences allows extracting additional features. The proposed models have been compared with the state-of-the-art models using several datasets. The results have shown that the proposed models have better performance, particularly in the longer documents.
    Uplifting Message Passing Neural Network with Graph Original Information. (arXiv:2210.05382v2 [cs.LG] UPDATED)
    Message passing neural networks (MPNNs) learn the representation of graph-structured data based on graph original information, including node features and graph structures, and have shown astonishing improvement in node classification tasks. However, the expressive power of MPNNs is upper bounded by the first-order Weisfeiler-Leman test and its accuracy still has room for improvement. This work studies how to improve MPNNs' expressiveness and generalizability by fully exploiting graph original information both theoretically and empirically. It further proposes a new GNN model called INGNN (INformation-enhanced Graph Neural Network) that leverages the insights to improve node classification performance. Extensive experiments on both synthetic and real datasets demonstrate the superiority (average rank 1.78) of our INGNN compared with state-of-the-art methods.  ( 2 min )
    Soft Labels for Rapid Satellite Object Detection. (arXiv:2212.00585v3 [cs.CV] UPDATED)
    Soft labels in image classification are vector representations of an image's true classification. In this paper, we investigate soft labels in the context of satellite object detection. We propose using detections as the basis for a new dataset of soft labels. Much of the effort in creating a high-quality model is gathering and annotating the training data. If we could use a model to generate a dataset for us, we could not only rapidly create datasets, but also supplement existing open-source datasets. Using a subset of the xView dataset, we train a YOLOv5 model to detect cars, planes, and ships. We then use that model to generate soft labels for the second training set which we then train and compare to the original model. We show that soft labels can be used to train a model that is almost as accurate as a model trained on the original data.  ( 2 min )
    Feasibility and Transferability of Transfer Learning: A Mathematical Framework. (arXiv:2301.11542v1 [cs.LG])
    Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. Despite its numerous empirical successes, theoretical analysis for transfer learning is limited. In this paper we build for the first time, to the best of our knowledge, a mathematical framework for the general procedure of transfer learning. Our unique reformulation of transfer learning as an optimization problem allows for the first time, analysis of its feasibility. Additionally, we propose a novel concept of transfer risk to evaluate transferability of transfer learning. Our numerical studies using the Office-31 dataset demonstrate the potential and benefits of incorporating transfer risk in the evaluation of transfer learning performance.
    A Robust Optimisation Perspective on Counterexample-Guided Repair of Neural Networks. (arXiv:2301.11342v1 [cs.LG])
    Counterexample-guided repair aims at creating neural networks with mathematical safety guarantees, facilitating the application of neural networks in safety-critical domains. However, whether counterexample-guided repair is guaranteed to terminate remains an open question. We approach this question by showing that counterexample-guided repair can be viewed as a robust optimisation algorithm. While termination guarantees for neural network repair itself remain beyond our reach, we prove termination for more restrained machine learning models and disprove termination in a general setting. We empirically study the practical implications of our theoretical results, demonstrating the suitability of common verifiers and falsifiers for repair despite a disadvantageous theoretical result. Additionally, we use our theoretical insights to devise a novel algorithm for repairing linear regression models, surpassing existing approaches.  ( 2 min )
    MLExchange: A web-based platform enabling exchangeable machine learning workflows for scientific studies. (arXiv:2208.09751v4 [cs.LG] UPDATED)
    Machine learning (ML) algorithms are showing a growing trend in helping the scientific communities across different disciplines and institutions to address large and diverse data problems. However, many available ML tools are programmatically demanding and computationally costly. The MLExchange project aims to build a collaborative platform equipped with enabling tools that allow scientists and facility users who do not have a profound ML background to use ML and computational resources in scientific discovery. At the high level, we are targeting a full user experience where managing and exchanging ML algorithms, workflows, and data are readily available through web applications. Since each component is an independent container, the whole platform or its individual service(s) can be easily deployed at servers of different scales, ranging from a personal device (laptop, smart phone, etc.) to high performance clusters (HPC) accessed (simultaneously) by many users. Thus, MLExchange renders flexible using scenarios -- users could either access the services and resources from a remote server or run the whole platform or its individual service(s) within their local network.  ( 2 min )
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v2 [cs.LG] UPDATED)
    Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.  ( 2 min )
    Phy-Q as a measure for physical reasoning intelligence. (arXiv:2108.13696v3 [cs.AI] UPDATED)
    Humans are well-versed in reasoning about the behaviors of physical objects and choosing actions accordingly to accomplish tasks, while it remains a major challenge for AI. To facilitate research addressing this problem, we propose a new testbed that requires an agent to reason about physical scenarios and take an action appropriately. Inspired by the physical knowledge acquired in infancy and the capabilities required for robots to operate in real-world environments, we identify 15 essential physical scenarios. We create a wide variety of distinct task templates, and we ensure all the task templates within the same scenario can be solved by using one specific strategic physical rule. By having such a design, we evaluate two distinct levels of generalization, namely the local generalization and the broad generalization. We conduct an extensive evaluation with human players, learning agents with varying input types and architectures, and heuristic agents with different strategies. Inspired by how human IQ is calculated, we define the physical reasoning quotient (Phy-Q score) that reflects the physical reasoning intelligence of an agent using the physical scenarios we considered. Our evaluation shows that 1) all agents are far below human performance, and 2) learning agents, even with good local generalization ability, struggle to learn the underlying physical reasoning rules and fail to generalize broadly. We encourage the development of intelligent agents that can reach the human level Phy-Q score. Website: https://github.com/phy-q/benchmark  ( 2 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v5 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Differential Privacy has Bounded Impact on Fairness in Classification. (arXiv:2210.16242v2 [cs.LG] UPDATED)
    We theoretically study the impact of differential privacy on fairness in classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on accuracy conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.  ( 2 min )
    Optimized Sparse Matrix Operations for Reverse Mode Automatic Differentiation. (arXiv:2212.05159v2 [cs.LG] UPDATED)
    Sparse matrix representations are ubiquitous in computational science and machine learning, leading to significant reductions in compute time, in comparison to dense representation, for problems that have local connectivity. The adoption of sparse representation in leading ML frameworks such as PyTorch is incomplete, however, with support for both automatic differentiation and GPU acceleration missing. In this work, we present an implementation of a CSR-based sparse matrix wrapper for PyTorch with CUDA acceleration for basic matrix operations, as well as automatic differentiability. We also present several applications of the resulting sparse kernels to optimization problems, demonstrating ease of implementation and performance measurements versus their dense counterparts.  ( 2 min )
    On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2211.13208v2 [cs.LG] UPDATED)
    Sample-efficient offline reinforcement learning (RL) with linear function approximation has recently been studied extensively. Much of prior work has yielded the minimax-optimal bound of $\tilde{\mathcal{O}}(\frac{1}{\sqrt{K}})$, with $K$ being the number of episodes in the offline data. In this work, we seek to understand instance-dependent bounds for offline RL with function approximation. We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI), which leverages data bootstrapping and constrained optimization on top of pessimism. We show that under a partial data coverage assumption, that of \emph{concentrability} with respect to an optimal policy, the proposed algorithm yields a fast rate of $\tilde{\mathcal{O}}(\frac{1}{K})$ for offline RL when there is a positive gap in the optimal Q-value functions, even when the offline data were adaptively collected. Moreover, when the linear features of the optimal actions in the states reachable by an optimal policy span those reachable by the behavior policy and the optimal actions are unique, offline RL achieves absolute zero sub-optimality error when $K$ exceeds a (finite) instance-dependent threshold. To the best of our knowledge, these are the first $\tilde{\mathcal{O}}(\frac{1}{K})$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data with partial coverage. We also provide instance-agnostic and instance-dependent information-theoretical lower bounds to complement our upper bounds.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v4 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v3 [stat.ML] UPDATED)
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. (arXiv:2210.06812v2 [cs.LG] UPDATED)
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.  ( 2 min )
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v2 [stat.ML] UPDATED)
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.  ( 2 min )
    Generalizability of Adversarial Robustness Under Distribution Shifts. (arXiv:2209.15042v2 [cs.LG] UPDATED)
    Recent progress in empirical and certified robustness promises to deliver reliable and deployable Deep Neural Networks (DNNs). Despite that success, most existing evaluations of DNN robustness have been done on images sampled from the same distribution on which the model was trained. However, in the real world, DNNs may be deployed in dynamic environments that exhibit significant distribution shifts. In this work, we take a first step towards thoroughly investigating the interplay between empirical and certified adversarial robustness on one hand and domain generalization on another. To do so, we train robust models on multiple domains and evaluate their accuracy and robustness on an unseen domain. We observe that: (1) both empirical and certified robustness generalize to unseen domains, and (2) the level of generalizability does not correlate well with input visual similarity, measured by the FID between source and target domains. We also extend our study to cover a real-world medical application, in which adversarial augmentation significantly boosts the generalization of robustness with minimal effect on clean data accuracy.  ( 2 min )
    Commonsense Knowledge Salience Evaluation with a Benchmark Dataset in E-commerce. (arXiv:2205.10843v2 [cs.CL] CROSS LISTED)
    In e-commerce, the salience of commonsense knowledge (CSK) is beneficial for widespread applications such as product search and recommendation. For example, when users search for ``running'' in e-commerce, they would like to find products highly related to running, such as ``running shoes'' rather than ``shoes''. Nevertheless, many existing CSK collections rank statements solely by confidence scores, and there is no information about which ones are salient from a human perspective. In this work, we define the task of supervised salience evaluation, where given a CSK triple, the model is required to learn whether the triple is salient or not. In addition to formulating the new task, we also release a new Benchmark dataset of Salience Evaluation in E-commerce (BSEE) and hope to promote related research on commonsense knowledge salience evaluation. We conduct experiments in the dataset with several representative baseline models. The experimental results show that salience evaluation is a challenging task where models perform poorly on our evaluation set. We further propose a simple but effective approach, PMI-tuning, which shows promise for solving this novel problem. Code is available in \url{https://github.com/OpenBGBenchmark/OpenBG-CSK.  ( 2 min )
    Task-Agnostic Graph Neural Network Evaluation via Adversarial Collaboration. (arXiv:2301.11517v1 [cs.LG])
    It has been increasingly demanding to develop reliable Graph Neural Network (GNN) evaluation methods to quantify the progress of the rapidly expanding GNN research. Existing GNN benchmarking methods focus on comparing the GNNs with respect to their performances on some node/graph classification/regression tasks in certain datasets. There lacks a principled, task-agnostic method to directly compare two GNNs. Moreover, most of the existing graph self-supervised learning (SSL) works incorporate handcrafted augmentations to the graph, which has several severe difficulties due to the unique characteristics of graph-structured data. To address the aforementioned issues, we propose GraphAC (Graph Adversarial Collaboration) -- a conceptually novel, principled, task-agnostic, and stable framework for evaluating GNNs through contrastive self-supervision. GraphAC succeeds in distinguishing GNNs of different expressiveness across various aspects, and has been proven to be a principled and reliable GNN evaluation method, eliminating the need for handcrafted augmentations for stable SSL.  ( 2 min )
    Learning and generalization of one-hidden-layer neural networks, going beyond standard Gaussian data. (arXiv:2207.03615v2 [cs.LG] UPDATED)
    This paper analyzes the convergence and generalization of training a one-hidden-layer neural network when the input features follow the Gaussian mixture model consisting of a finite number of Gaussian distributions. Assuming the labels are generated from a teacher model with an unknown ground truth weight, the learning problem is to estimate the underlying teacher model by minimizing a non-convex risk function over a student neural network. With a finite number of training samples, referred to the sample complexity, the iterations are proved to converge linearly to a critical point with guaranteed generalization error. In addition, for the first time, this paper characterizes the impact of the input distributions on the sample complexity and the learning rate.  ( 2 min )
    GATE: Gated Additive Tree Ensemble for Tabular Classification and Regression. (arXiv:2207.08548v4 [cs.LG] UPDATED)
    We propose a novel high-performance, parameter and computationally efficient deep learning architecture for tabular data, Gated Additive Tree Ensemble(GATE). GATE uses a gating mechanism, inspired from GRU, as a feature representation learning unit with an in-built feature selection mechanism. We combine it with an ensemble of differentiable, non-linear decision trees, re-weighted with simple self-attention to predict our desired output. We demonstrate that GATE is a competitive alternative to SOTA approaches like GBDTs, NODE, FT Transformers, etc. by experiments on several public datasets (both classification and regression). We have made available the code at https://github.com/manujosephv/GATE under MIT License.  ( 2 min )
    Explaining Patterns in Data with Language Models via Interpretable Autoprompting. (arXiv:2210.01848v2 [cs.LG] UPDATED)
    Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v2 [cs.LG] UPDATED)
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time-Series Detection. (arXiv:2202.06593v2 [stat.ML] UPDATED)
    We study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too difficult to derive because it is obtained based on the solution of the DTW algorithm, which is complicated. To circumvent this difficulty, we propose to employ the conditional selective inference framework, which enables us to derive a valid inference method on the DTW distance. To our knowledge, this is the first method that can provide a valid p-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making such as abnormal time-series detection problems. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    Constrained Parameter Inference as a Principle for Learning. (arXiv:2203.13203v5 [cs.NE] UPDATED)
    Learning in neural networks is often framed as a problem in which targeted error signals are directly propagated to parameters and used to produce updates that induce more optimal network behaviour. Backpropagation of error (BP) is an example of such an approach and has proven to be a highly successful application of stochastic gradient descent to deep neural networks. We propose constrained parameter inference (COPI) as a new principle for learning. The COPI approach assumes that learning can be set up in a manner where parameters infer their own values based upon observations of their local neuron activities. We find that this estimation of network parameters is possible under the constraints of decorrelated neural inputs and top-down perturbations of neural states for credit assignment. We show that the decorrelation required for COPI allows learning at extremely high learning rates, competitive with that of adaptive optimizers, as used by BP. We further demonstrate that COPI affords a new approach to feature analysis and network compression. Finally, we argue that COPI may shed new light on learning in biological networks given the evidence for decorrelation in the brain.
    Demystifying Reinforcement Learning in Time-Varying Systems. (arXiv:2201.05560v2 [cs.LG] UPDATED)
    Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity, shed light on the range of approaches to them and develop a robust framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (i) identifies different environments encountered by the live system, (ii) triggers exploration when necessary, (iii) takes precautions to retain knowledge from prior environments, and (iv) employs safeguards to protect the system's performance when the RL agent makes mistakes. We apply our framework to two systems problems, straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that all components of the framework are necessary to cope with non-stationarity and provide guidance on alternative design choices for each component.
    AdaBoost is not an Optimal Weak to Strong Learner. (arXiv:2301.11571v1 [cs.LG])
    AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented the first provably optimal weak-to-strong learner. However, their algorithm is somewhat complicated and it remains an intriguing question whether the prototypical boosting algorithm AdaBoost also makes optimal use of training samples. In this work, we answer this question in the negative. Concretely, we show that the sample complexity of AdaBoost, and other classic variations thereof, are sub-optimal by at least one logarithmic factor in the desired accuracy of the strong learner.  ( 2 min )
    A critical look at deep neural network for dynamic system modeling. (arXiv:2301.11604v1 [cs.LG])
    Neural network models become increasingly popular as dynamic modeling tools in the control community. They have many appealing features including nonlinear structures, being able to approximate any functions. While most researchers hold optimistic attitudes towards such models, this paper questions the capability of (deep) neural networks for the modeling of dynamic systems using input-output data. For the identification of linear time-invariant (LTI) dynamic systems, two representative neural network models, Long Short-Term Memory (LSTM) and Cascade Foward Neural Network (CFNN) are compared to the standard Prediction Error Method (PEM) of system identification. In the comparison, four essential aspects of system identification are considered, then several possible defects and neglected issues of neural network based modeling are pointed out. Detailed simulation studies are performed to verify these defects: for the LTI system, both LSTM and CFNN fail to deliver consistent models even in noise-free cases; and they give worse results than PEM in noisy cases.  ( 2 min )
    Neural Episodic Control with State Abstraction. (arXiv:2301.11490v1 [cs.LG])
    Existing Deep Reinforcement Learning (DRL) algorithms suffer from sample inefficiency. Generally, episodic control-based approaches are solutions that leverage highly-rewarded past experiences to improve sample efficiency of DRL algorithms. However, previous episodic control-based approaches fail to utilize the latent information from the historical behaviors (e.g., state transitions, topological similarities, etc.) and lack scalability during DRL training. This work introduces Neural Episodic Control with State Abstraction (NECSA), a simple but effective state abstraction-based episodic control containing a more comprehensive episodic memory, a novel state evaluation, and a multi-step state analysis. We evaluate our approach to the MuJoCo and Atari tasks in OpenAI gym domains. The experimental results indicate that NECSA achieves higher sample efficiency than the state-of-the-art episodic control-based approaches. Our data and code are available at the project website\footnote{\url{https://sites.google.com/view/drl-necsa}}.  ( 2 min )
    Cellular Network Capacity and Coverage Enhancement with MDT Data and Deep Reinforcement Learning. (arXiv:2202.10968v1 [cs.NI] CROSS LISTED)
    Recent years witnessed a remarkable increase in the availability of data and computing resources in communication networks. This contributed to the rise of data-driven over model-driven algorithms for network automation. This paper investigates a Minimization of Drive Tests (MDT)-driven Deep Reinforcement Learning (DRL) algorithm to optimize coverage and capacity by tuning antennas tilts on a cluster of cells from TIM's cellular network. We jointly utilize MDT data, electromagnetic simulations, and network Key Performance indicators (KPIs) to define a simulated network environment for the training of a Deep Q-Network (DQN) agent. Some tweaks have been introduced to the classical DQN formulation to improve the agent's sample efficiency, stability, and performance. In particular, a custom exploration policy is designed to introduce soft constraints at training time. Results show that the proposed algorithm outperforms baseline approaches like DQN and best-fist search in terms of long-term reward and sample efficiency. Our results indicate that MDT-driven approaches constitute a valuable tool for autonomous coverage and capacity optimization of mobile radio networks.
    Fine-tuning Neural-Operator architectures for training and generalization. (arXiv:2301.11509v1 [cs.LG])
    In this work, we present an analysis of the generalization of Neural Operators (NOs) and derived architectures. We proposed a family of networks, which we name (${\textit{s}}{\text{NO}}+\varepsilon$), where we modify the layout of NOs towards an architecture resembling a Transformer; mainly, we substitute the Attention module with the Integral Operator part of NOs. The resulting network preserves universality, has a better generalization to unseen data, and similar number of parameters as NOs. On the one hand, we study numerically the generalization by gradually transforming NOs into ${\textit{s}}{\text{NO}}+\varepsilon$ and verifying a reduction of the test loss considering a time-harmonic wave dataset with different frequencies. We perform the following changes in NOs: (a) we split the Integral Operator (non-local) and the (local) feed-forward network (MLP) into different layers, generating a {\it sequential} structure which we call sequential Neural Operator (${\textit{s}}{\text{NO}}$), (b) we add the skip connection, and layer normalization in ${\textit{s}}{\text{NO}}$, and (c) we incorporate dropout and stochastic depth that allows us to generate deep networks. In each case, we observe a decrease in the test loss in a wide variety of initialization, indicating that our changes outperform the NO. On the other hand, building on infinite-dimensional Statistics, and in particular the Dudley Theorem, we provide bounds of the Rademacher complexity of NOs and ${\textit{s}}{\text{NO}}$, and we find the following relationship: the upper bound of the Rademacher complexity of the ${\textit{s}}{\text{NO}}$ is a lower-bound of the NOs, thereby, the generalization error bound of ${\textit{s}}{\text{NO}}$ is smaller than NO, which further strengthens our numerical results.
    Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2209.06620v3 [cs.LG] UPDATED)
    Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.
    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions. (arXiv:2301.11885v1 [stat.ML])
    Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.
    ExplainableFold: Understanding AlphaFold Prediction with Explainable AI. (arXiv:2301.11765v1 [cs.AI])
    This paper presents ExplainableFold, an explainable AI framework for protein structure prediction. Despite the success of AI-based methods such as AlphaFold in this field, the underlying reasons for their predictions remain unclear due to the black-box nature of deep learning models. To address this, we propose a counterfactual learning framework inspired by biological principles to generate counterfactual explanations for protein structure prediction, enabling a dry-lab experimentation approach. Our experimental results demonstrate the ability of ExplainableFold to generate high-quality explanations for AlphaFold's predictions, providing near-experimental understanding of the effects of amino acids on 3D protein structure. This framework has the potential to facilitate a deeper understanding of protein structures.
    Gene Teams are on the Field: Evaluation of Variants in Gene-Networks Using High Dimensional Modelling. (arXiv:2301.11763v1 [cs.LG])
    In medical genetics, each genetic variant is evaluated as an independent entity regarding its clinical importance. However, in most complex diseases, variant combinations in specific gene networks, rather than the presence of a particular single variant, predominates. In the case of complex diseases, disease status can be evaluated by considering the success level of a team of specific variants. We propose a high dimensional modelling based method to analyse all the variants in a gene network together. To evaluate our method, we selected two gene networks, mTOR and TGF-Beta. For each pathway, we generated 400 control and 400 patient group samples. mTOR and TGF-? pathways contain 31 and 93 genes of varying sizes, respectively. We produced Chaos Game Representation images for each gene sequence to obtain 2-D binary patterns. These patterns were arranged in succession, and a 3-D tensor structure was achieved for each gene network. Features for each data sample were acquired by exploiting Enhanced Multivariance Products Representation to 3-D data. Features were split as training and testing vectors. Training vectors were employed to train a Support Vector Machines classification model. We achieved more than 96% and 99% classification accuracies for mTOR and TGF-Beta networks, respectively, using a limited amount of training samples.
    Invariant Meta Learning for Out-of-Distribution Generalization. (arXiv:2301.11779v1 [cs.LG])
    Modern deep learning techniques have illustrated their excellent capabilities in many areas, but relies on large training data. Optimization-based meta-learning train a model on a variety tasks, such that it can solve new learning tasks using only a small number of training samples.However, these methods assumes that training and test dataare identically and independently distributed. To overcome such limitation, in this paper, we propose invariant meta learning for out-of-distribution tasks. Specifically, invariant meta learning find invariant optimal meta-initialization,and fast adapt to out-of-distribution tasks with regularization penalty. Extensive experiments demonstrate the effectiveness of our proposed invariant meta learning on out-of-distribution few-shot tasks.
    Detecting Pump&Dump Stock Market Manipulation from Online Forums. (arXiv:2301.11403v1 [cs.SI])
    The intersection of social media, low-cost trading platforms, and naive investors has created an ideal situation for information-based market manipulations, especially pump&dumps. Manipulators accumulate small-cap stocks, disseminate false information on social media to inflate their price, and sell at the peak. We collect a dataset of stocks whose price and volume profiles have the characteristic shape of a pump&dump, and social media posts for those same stocks that match the timing of the initial price rises. From these we build predictive models for pump&dump events based on the language used in the social media posts. There are multiple difficulties: not every post will cause the intended market reaction, some pump&dump events may be triggered by posts in other forums, and there may be accidental confluences of post timing and market movements. Nevertheless, our best model achieves a prediction accuracy of 85% and an F1-score of 62%. Such a tool can provide early warning to investors and regulators that a pump&dump may be underway.
    Deep Residual Compensation Convolutional Network without Backpropagation. (arXiv:2301.11663v1 [cs.CV])
    PCANet and its variants provided good accuracy results for classification tasks. However, despite the importance of network depth in achieving good classification accuracy, these networks were trained with a maximum of nine layers. In this paper, we introduce a residual compensation convolutional network, which is the first PCANet-like network trained with hundreds of layers while improving classification accuracy. The design of the proposed network consists of several convolutional layers, each followed by post-processing steps and a classifier. To correct the classification errors and significantly increase the network's depth, we train each layer with new labels derived from the residual information of all its preceding layers. This learning mechanism is accomplished by traversing the network's layers in a single forward pass without backpropagation or gradient computations. Our experiments on four distinct classification benchmarks (MNIST, CIFAR-10, CIFAR-100, and TinyImageNet) show that our deep network outperforms all existing PCANet-like networks and is competitive with several traditional gradient-based models.  ( 2 min )
    Convolutional neural networks for valid and efficient causal inference. (arXiv:2301.11732v1 [stat.ML])
    Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.  ( 2 min )
    Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers. (arXiv:2301.11578v1 [cs.LG])
    Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.  ( 2 min )
    CAPoW: Context-Aware AI-Assisted Proof of Work based DDoS Defense. (arXiv:2301.11767v1 [cs.CR])
    Critical servers can be secured against distributed denial of service (DDoS) attacks using proof of work (PoW) systems assisted by an Artificial Intelligence (AI) that learns contextual network request patterns. In this work, we introduce CAPoW, a context-aware anti-DDoS framework that injects latency adaptively during communication by utilizing context-aware PoW puzzles. In CAPoW, a security professional can define relevant request context attributes which can be learned by the AI system. These contextual attributes can include information about the user request, such as IP address, time, flow-level information, etc., and are utilized to generate a contextual score for incoming requests that influence the hardness of a PoW puzzle. These puzzles need to be solved by a user before the server begins to process their request. Solving puzzles slow down the volume of incoming adversarial requests. Additionally, the framework compels the adversary to incur a cost per request, hence making it expensive for an adversary to prolong a DDoS attack. We include the theoretical foundations of the CAPoW framework along with a description of its implementation and evaluation.
    A denoting diffusion model for fluid flow prediction. (arXiv:2301.11661v1 [cs.LG])
    We propose a novel denoising diffusion generative model for predicting nonlinear fluid fields named FluidDiff. By performing a diffusion process, the model is able to learn a complex representation of the high-dimensional dynamic system, and then Langevin sampling is used to generate predictions for the flow state under specified initial conditions. The model is trained with finite, discrete fluid simulation data. We demonstrate that our model has the capacity to model the distribution of simulated training data and that it gives accurate predictions on the test data. Without encoded prior knowledge of the underlying physical system, it shares competitive performance with other deep learning models for fluid prediction, which is promising for investigation on new computational fluid dynamics methods.  ( 2 min )
    TransNet: Transferable Neural Networks for Partial Differential Equations. (arXiv:2301.11701v1 [math.NA])
    Transfer learning for partial differential equations (PDEs) is to develop a pre-trained neural network that can be used to solve a wide class of PDEs. Existing transfer learning approaches require much information of the target PDEs such as its formulation and/or data of its solution for pre-training. In this work, we propose to construct transferable neural feature spaces from purely function approximation perspectives without using PDE information. The construction of the feature space involves re-parameterization of the hidden neurons and uses auxiliary functions to tune the resulting feature space. Theoretical analysis shows the high quality of the produced feature space, i.e., uniformly distributed neurons. Extensive numerical experiments verify the outstanding performance of our method, including significantly improved transferability, e.g., using the same feature space for various PDEs with different domains and boundary conditions, and the superior accuracy, e.g., several orders of magnitude smaller mean squared error than the state of the art methods.  ( 2 min )
    Mixed Attention Network for Hyperspectral Image Denoising. (arXiv:2301.11525v1 [cs.CV])
    Hyperspectral image denoising is unique for the highly similar and correlated spectral information that should be properly considered. However, existing methods show limitations in exploring the spectral correlations across different bands and feature interactions within each band. Besides, the low- and high-level features usually exhibit different importance for different spatial-spectral regions, which is not fully explored for current algorithms as well. In this paper, we present a Mixed Attention Network (MAN) that simultaneously considers the inter- and intra-spectral correlations as well as the interactions between low- and high-level spatial-spectral meaningful features. Specifically, we introduce a multi-head recurrent spectral attention that efficiently integrates the inter-spectral features across all the spectral bands. These features are further enhanced with a progressive spectral channel attention by exploring the intra-spectral relationships. Moreover, we propose an attentive skip-connection that adaptively controls the proportion of the low- and high-level spatial-spectral features from the encoder and decoder to better enhance the aggregated features. Extensive experiments show that our MAN outperforms existing state-of-the-art methods on simulated and real noise settings while maintaining a low cost of parameters and running time.
    FedHP: Heterogeneous Federated Learning with Privacy-preserving. (arXiv:2301.11705v1 [cs.LG])
    Federated Learning is a distributed machine learning environment, which ensures that clients complete collaborative training without sharing private data, only by exchanging parameters. However, the data does not satisfy the same distribution and the computing resources of clients are different, which brings challenges to the related research. To better solve the above heterogeneous problems, we designed a novel federated learning method. The local model consists of the pre-trained model as the backbone and fully connected layers as the head. The backbone can extract features for the head, and the embedding vector of classes is shared between clients to optimize the head so that the local model can perform better. By sharing the embedding vector of classes, instead of parameters based on gradient space, clients can better adapt to private data, and it is more efficient in the communication between the server and clients. To better protect privacy, we proposed a privacy-preserving hybrid method to add noise to the embedding vector of classes, which has less impact on the local model performance under the premise of satisfying differential privacy. We conduct a comprehensive evaluation with other federated learning methods on the self-built vehicle dataset under non-independent identically distributed(Non-IID)  ( 2 min )
    Variance, Self-Consistency, and Arbitrariness in Fair Classification. (arXiv:2301.11562v1 [cs.LG])
    In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.  ( 2 min )
    Incorporating Knowledge into Document Summarization: an Application of Prefix-Tuning on GPT-2. (arXiv:2301.11719v1 [cs.CL])
    Despite the great development of document summarization techniques nowadays, factual inconsistencies between the generated summaries and the original text still occur from time to time. This paper proposes a prefix-tuning-based approach that uses a set of trainable continuous prefix prompt together with discrete prompts to aid model generation, which makes a significant impact on both CNN/Daily Mail and XSum summaries generated using GPT-2. The improvements on fact preservation in the generated summaries indicates the effectiveness of adopting this prefix-tuning-based method in knowledge-enhanced document summarization, and also shows a great potential on other natural language processing tasks.  ( 2 min )
    Behaviour Discriminator: A Simple Data Filtering Method to Improve Offline Policy Learning. (arXiv:2301.11734v1 [cs.LG])
    This paper studies the problem of learning a control policy without the need for interactions with the environment; instead, learning purely from an existing dataset. Prior work has demonstrated that offline learning algorithms (e.g., behavioural cloning and offline reinforcement learning) are more likely to discover a satisfactory policy when trained using high-quality expert data. However, many real-world/practical datasets can contain significant proportions of examples generated using low-skilled agents. Therefore, we propose a behaviour discriminator (BD) concept, a novel and simple data filtering approach based on semi-supervised learning, which can accurately discern expert data from a mixed-quality dataset. Our BD approach was used to pre-process the mixed-skill-level datasets from the Real Robot Challenge (RRC) III, an open competition requiring participants to solve several dexterous robotic manipulation tasks using offline learning methods; the new BD method allowed a standard behavioural cloning algorithm to outperform other more sophisticated offline learning algorithms. Moreover, we demonstrate that the new BD pre-processing method can be applied to a number of D4RL benchmark problems, improving the performance of multiple state-of-the-art offline reinforcement learning algorithms.  ( 2 min )
    Large-Scale Traffic Data Imputation with Spatiotemporal Semantic Understanding. (arXiv:2301.11691v1 [cs.LG])
    Large-scale data missing is a challenging problem in Intelligent Transportation Systems (ITS). Many studies have been carried out to impute large-scale traffic data by considering their spatiotemporal correlations at a network level. In existing traffic data imputations, however, rich semantic information of a road network has been largely ignored when capturing network-wide spatiotemporal correlations. This study proposes a Graph Transformer for Traffic Data Imputation (GT-TDI) model to impute large-scale traffic data with spatiotemporal semantic understanding of a road network. Specifically, the proposed model introduces semantic descriptions consisting of network-wide spatial and temporal information of traffic data to help the GT-TDI model capture spatiotemporal correlations at a network level. The proposed model takes incomplete data, the social connectivity of sensors, and semantic descriptions as input to perform imputation tasks with the help of Graph Neural Networks (GNN) and Transformer. On the PeMS freeway dataset, extensive experiments are conducted to compare the proposed GT-TDI model with conventional methods, tensor factorization methods, and deep learning-based methods. The results show that the proposed GT-TDI outperforms existing methods in complex missing patterns and diverse missing rates. The code of the GT-TDI model will be available at https://github.com/KP-Zhang/GT-TDI.  ( 2 min )
    A Green(er) World for A.I. (arXiv:2301.11581v1 [cs.AI])
    As research and practice in artificial intelligence (A.I.) grow in leaps and bounds, the resources necessary to sustain and support their operations also grow at an increasing pace. While innovations and applications from A.I. have brought significant advances, from applications to vision and natural language to improvements to fields like medical imaging and materials engineering, their costs should not be neglected. As we embrace a world with ever-increasing amounts of data as well as research and development of A.I. applications, we are sure to face an ever-mounting energy footprint to sustain these computational budgets, data storage needs, and more. But, is this sustainable and, more importantly, what kind of setting is best positioned to nurture such sustainable A.I. in both research and practice? In this paper, we outline our outlook for Green A.I. -- a more sustainable, energy-efficient and energy-aware ecosystem for developing A.I. across the research, computing, and practitioner communities alike -- and the steps required to arrive there. We present a bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenters/HPCs to the current incentive structures in the world of A.I. research and practice, and more. We hope these points will spur further discussion, and action, on some of these issues and their potential solutions.  ( 2 min )
    Solving Constrained Reinforcement Learning through Augmented State and Reward Penalties. (arXiv:2301.11592v1 [cs.LG])
    Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.
    Learning to Generate All Feasible Actions. (arXiv:2301.11461v1 [cs.LG])
    Several machine learning (ML) applications are characterized by searching for an optimal solution to a complex task. The search space for this optimal solution is often very large, so large in fact that this optimal solution is often not computable. Part of the problem is that many candidate solutions found via ML are actually infeasible and have to be discarded. Restricting the search space to only the feasible solution candidates simplifies finding an optimal solution for the tasks. Further, the set of feasible solutions could be re-used in multiple problems characterized by different tasks. In particular, we observe that complex tasks can be decomposed into subtasks and corresponding skills. We propose to learn a reusable and transferable skill by training an actor to generate all feasible actions. The trained actor can then propose feasible actions, among which an optimal one can be chosen according to a specific task. The actor is trained by interpreting the feasibility of each action as a target distribution. The training procedure minimizes a divergence of the actor's output distribution to this target. We derive the general optimization target for arbitrary f-divergences using a combination of kernel density estimates, resampling, and importance sampling. We further utilize an auxiliary critic to reduce the interactions with the environment. A preliminary comparison to related strategies shows that our approach learns to visit all the modes in the feasible action space, demonstrating the framework's potential for learning skills that can be used in various downstream tasks.
    Down the Rabbit Hole: Detecting Online Extremism, Radicalisation, and Politicised Hate Speech. (arXiv:2301.11579v1 [cs.SI])
    Social media is a modern person's digital voice to project and engage with new ideas and mobilise communities $\unicode{x2013}$ a power shared with extremists. Given the societal risks of unvetted content-moderating algorithms for Extremism, Radicalisation, and Hate speech (ERH) detection, responsible software engineering must understand the who, what, when, where, and why such models are necessary to protect user safety and free expression. Hence, we propose and examine the unique research field of ERH context mining to unify disjoint studies. Specifically, we evaluate the start-to-finish design process from socio-technical definition-building and dataset collection strategies to technical algorithm design and performance. Our 2015-2021 51-study Systematic Literature Review (SLR) provides the first cross-examination of textual, network, and visual approaches to detecting extremist affiliation, hateful content, and radicalisation towards groups and movements. We identify consensus-driven ERH definitions and propose solutions to existing ideological and geographic biases, particularly due to the lack of research in Oceania/Australasia. Our hybridised investigation on Natural Language Processing, Community Detection, and visual-text models demonstrates the dominating performance of textual transformer-based algorithms. We conclude with vital recommendations for ERH context mining researchers and propose an uptake roadmap with guidelines for researchers, industries, and governments to enable a safer cyberspace.  ( 2 min )
    Adversarial Learning for Implicit Semantic-Aware Communications. (arXiv:2301.11589v1 [cs.LG])
    Semantic communication is a novel communication paradigm that focuses on recognizing and delivering the desired meaning of messages to the destination users. Most existing works in this area focus on delivering explicit semantics, labels or signal features that can be directly identified from the source signals. In this paper, we consider the implicit semantic communication problem in which hidden relations and closely related semantic terms that cannot be recognized from the source signals need to also be delivered to the destination user. We develop a novel adversarial learning-based implicit semantic-aware communication (iSAC) architecture in which the source user, instead of maximizing the total amount of information transmitted to the channel, aims to help the recipient learn an inference rule that can automatically generate implicit semantics based on limited clue information. We prove that by applying iSAC, the destination user can always learn an inference rule that matches the true inference rule of the source messages. Experimental results show that the proposed iSAC can offer up to a 19.69 dB improvement over existing non-inferential communication solutions, in terms of symbol error rate at the destination user.  ( 2 min )
    Can We Use Probing to Better Understand Fine-tuning and Knowledge Distillation of the BERT NLU?. (arXiv:2301.11688v1 [cs.CL])
    In this article, we use probing to investigate phenomena that occur during fine-tuning and knowledge distillation of a BERT-based natural language understanding (NLU) model. Our ultimate purpose was to use probing to better understand practical production problems and consequently to build better NLU models. We designed experiments to see how fine-tuning changes the linguistic capabilities of BERT, what the optimal size of the fine-tuning dataset is, and what amount of information is contained in a distilled NLU based on a tiny Transformer. The results of the experiments show that the probing paradigm in its current form is not well suited to answer such questions. Structural, Edge and Conditional probes do not take into account how easy it is to decode probed information. Consequently, we conclude that quantification of information decodability is critical for many practical applications of the probing paradigm.  ( 2 min )
    Online Learning in Stackelberg Games with an Omniscient Follower. (arXiv:2301.11518v1 [cs.LG])
    We study the problem of online learning in a two-player decentralized cooperative Stackelberg game. In each round, the leader first takes an action, followed by the follower who takes their action after observing the leader's move. The goal of the leader is to learn to minimize the cumulative regret based on the history of interactions. Differing from the traditional formulation of repeated Stackelberg games, we assume the follower is omniscient, with full knowledge of the true reward, and that they always best-respond to the leader's actions. We analyze the sample complexity of regret minimization in this repeated Stackelberg game. We show that depending on the reward structure, the existence of the omniscient follower may change the sample complexity drastically, from constant to exponential, even for linear cooperative Stackelberg games. This poses unique challenges for the learning process of the leader and the subsequent regret analysis.
    Neural Wasserstein Gradient Flows for Maximum Mean Discrepancies with Riesz Kernels. (arXiv:2301.11624v1 [cs.LG])
    Wasserstein gradient flows of maximum mean discrepancy (MMD) functionals with non-smooth Riesz kernels show a rich structure as singular measures can become absolutely continuous ones and conversely. In this paper we contribute to the understanding of such flows. We propose to approximate the backward scheme of Jordan, Kinderlehrer and Otto for computing such Wasserstein gradient flows as well as a forward scheme for so-called Wasserstein steepest descent flows by neural networks (NNs). Since we cannot restrict ourselves to absolutely continuous measures, we have to deal with transport plans and velocity plans instead of usual transport maps and velocity fields. Indeed, we approximate the disintegration of both plans by generative NNs which are learned with respect to appropriate loss functions. In order to evaluate the quality of both neural schemes, we benchmark them on the interaction energy. Here we provide analytic formulas for Wasserstein schemes starting at a Dirac measure and show their convergence as the time step size tends to zero. Finally, we illustrate our neural MMD flows by numerical examples.
    SNeRL: Semantic-aware Neural Radiance Fields for Reinforcement Learning. (arXiv:2301.11520v1 [cs.LG])
    As previous representations for reinforcement learning cannot effectively incorporate a human-intuitive understanding of the 3D environment, they usually suffer from sub-optimal performances. In this paper, we present Semantic-aware Neural Radiance Fields for Reinforcement Learning (SNeRL), which jointly optimizes semantic-aware neural radiance fields (NeRF) with a convolutional encoder to learn 3D-aware neural implicit representation from multi-view images. We introduce 3D semantic and distilled feature fields in parallel to the RGB radiance fields in NeRF to learn semantic and object-centric representation for reinforcement learning. SNeRL outperforms not only previous pixel-based representations but also recent 3D-aware representations both in model-free and model-based reinforcement learning.  ( 2 min )
    PLay: Parametrically Conditioned Layout Generation using Latent Diffusion. (arXiv:2301.11529v1 [cs.LG])
    Layout design is an important task in various design fields, including user interfaces, document, and graphic design. As this task requires tedious manual effort by designers, prior works have attempted to automate this process using generative models, but commonly fell short of providing intuitive user controls and achieving design objectives. In this paper, we build a conditional latent diffusion model, PLay, that generates parametrically conditioned layouts in vector graphic space from user-specified guidelines, which are commonly used by designers for representing their design intents in current practices. Our method outperforms prior works across three datasets on metrics including FID and FD-VG, and in user test. Moreover, it brings a novel and interactive experience to professional layout design processes.  ( 2 min )
    Improving deep learning precipitation nowcasting by using prior knowledge. (arXiv:2301.11707v1 [cs.LG])
    Deep learning methods dominate short-term high-resolution precipitation nowcasting in terms of prediction error. However, their operational usability is limited by difficulties explaining dynamics behind the predictions, which are smoothed out and missing the high-frequency features due to optimizing for mean error loss functions. We experiment with hand-engineering of the advection-diffusion differential equation into a PhyCell to introduce more accurate physical prior to a PhyDNet model that disentangles physical and residual dynamics. Results indicate that while PhyCell can learn the intended dynamics, training of PhyDNet remains driven by loss optimization, resulting in a model with the same prediction capabilities.
    Class-Incremental Learning with Repetition. (arXiv:2301.11396v1 [cs.LG])
    Real-world data streams naturally include the repetition of previous concepts. From a Continual Learning (CL) perspective, repetition is a property of the environment and, unlike replay, cannot be controlled by the user. Nowadays, Class-Incremental scenarios represent the leading test-bed for assessing and comparing CL strategies. This family of scenarios is very easy to use, but it never allows revisiting previously seen classes, thus completely disregarding the role of repetition. We focus on the family of Class-Incremental with Repetition (CIR) scenarios, where repetition is embedded in the definition of the stream. We propose two stochastic scenario generators that produce a wide range of CIR scenarios starting from a single dataset and a few control parameters. We conduct the first comprehensive evaluation of repetition in CL by studying the behavior of existing CL strategies under different CIR scenarios. We then present a novel replay strategy that exploits repetition and counteracts the natural imbalance present in the stream. On both CIFAR100 and TinyImageNet, our strategy outperforms other replay approaches, which are not designed for environments with repetition.  ( 2 min )
    PhysGraph: Physics-Based Integration Using Graph Neural Networks. (arXiv:2301.11841v1 [cs.GR])
    Physics-based simulation of mesh based domains remains a challenging task. State-of-the-art techniques can produce realistic results but require expert knowledge. A major bottleneck in many approaches is the step of integrating a potential energy in order to compute velocities or displacements. Recently, learning based method for physics-based simulation have sparked interest with graph based approaches being a promising research direction. One of the challenges for these methods is to generate models that are mesh independent and generalize to different material properties. Moreover, the model should also be able to react to unforeseen external forces like ubiquitous collisions. Our contribution is based on a simple observation: evaluating forces is computationally relatively cheap for traditional simulation methods and can be computed in parallel in contrast to their integration. If we learn how a system reacts to forces in general, irrespective of their origin, we can learn an integrator that can predict state changes due to the total forces with high generalization power. We effectively factor out the physical model behind resulting forces by relying on an opaque force module. We demonstrate that this idea leads to a learnable module that can be trained on basic internal forces of small mesh patches and generalizes to different mesh typologies, resolutions, material parameters and unseen forces like collisions at inference time. Our proposed paradigm is general and can be used to model a variety of physical phenomena. We focus our exposition on the detail enhancement of coarse clothing geometry which has many applications including computer games, virtual reality and virtual try-on.  ( 2 min )
    BOMP-NAS: Bayesian Optimization Mixed Precision NAS. (arXiv:2301.11810v1 [cs.LG])
    Bayesian Optimization Mixed-Precision Neural Architecture Search (BOMP-NAS) is an approach to quantization-aware neural architecture search (QA-NAS) that leverages both Bayesian optimization (BO) and mixed-precision quantization (MP) to efficiently search for compact, high performance deep neural networks. The results show that integrating quantization-aware fine-tuning (QAFT) into the NAS loop is a necessary step to find networks that perform well under low-precision quantization: integrating it allows a model size reduction of nearly 50\% on the CIFAR-10 dataset. BOMP-NAS is able to find neural networks that achieve state of the art performance at much lower design costs. This study shows that BOMP-NAS can find these neural networks at a 6x shorter search time compared to the closest related work.  ( 2 min )
    Feature space exploration as an alternative for design space exploration beyond the parametric space. (arXiv:2301.11416v1 [cs.LG])
    This paper compares the parametric design space with a feature space generated by the extraction of design features using deep learning (DL) as an alternative way for design space exploration. In this comparison, the parametric design space is constructed by creating a synthetic dataset of 15.000 elements using a parametric algorithm and reducing its dimensions for visualization. The feature space - reduced-dimensionality vector space of embedded data features - is constructed by training a DL model on the same dataset. We analyze and compare the extracted design features by reducing their dimension and visualizing the results. We demonstrate that parametric design space is narrow in how it describes the design solutions because it is based on the combination of individual parameters. In comparison, we observed that the feature design space can intuitively represent design solutions according to complex parameter relationships. Based on our results, we discuss the potential of translating the features learned by DL models to provide a mechanism for intuitive design exploration space and visualization of possible design solutions.
    Learning the Dynamics of Sparsely Observed Interacting Systems. (arXiv:2301.11647v1 [stat.ML])
    We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.  ( 2 min )
    DBGDGM: Dynamic Brain Graph Deep Generative Model. (arXiv:2301.11408v1 [cs.LG])
    Graphs are a natural representation of brain activity derived from functional magnetic imaging (fMRI) data. It is well known that clusters of anatomical brain regions, known as functional connectivity networks (FCNs), encode temporal relationships which can serve as useful biomarkers for understanding brain function and dysfunction. Previous works, however, ignore the temporal dynamics of the brain and focus on static graphs. In this paper, we propose a dynamic brain graph deep generative model (DBGDGM) which simultaneously clusters brain regions into temporally evolving communities and learns dynamic unsupervised node embeddings. Specifically, DBGDGM represents brain graph nodes as embeddings sampled from a distribution over communities that evolve over time. We parameterise this community distribution using neural networks that learn from subject and node embeddings as well as past community assignments. Experiments demonstrate DBGDGM outperforms baselines in graph generation, dynamic link prediction, and is comparable for graph classification. Finally, an analysis of the learnt community distributions reveals overlap with known FCNs reported in neuroscience literature.  ( 2 min )
    Nik Defense: An Artificial Intelligence Based Defense Mechanism against Selfish Mining in Bitcoin. (arXiv:2301.11463v1 [cs.CR])
    The Bitcoin cryptocurrency has received much attention recently. In the network of Bitcoin, transactions are recorded in a ledger. In this network, the process of recording transactions depends on some nodes called miners that execute a protocol known as mining protocol. One of the significant aspects of mining protocol is incentive compatibility. However, literature has shown that Bitcoin mining's protocol is not incentive-compatible. Some nodes with high computational power can obtain more revenue than their fair share by adopting a type of attack called the selfish mining attack. In this paper, we propose an artificial intelligence-based defense against selfish mining attacks by applying the theory of learning automata. The proposed defense mechanism ignores private blocks by assigning weight based on block discovery time and changes current Bitcoin's fork resolving policy by evaluating branches' height difference in a self-adaptive manner utilizing learning automata. To the best of our knowledge, the proposed protocol is the literature's first learning-based defense mechanism. Simulation results have shown the superiority of the proposed mechanism against tie-breaking mechanism, which is a well-known defense. The simulation results have shown that the suggested defense mechanism increases the profit threshold up to 40\% and decreases the revenue of selfish attackers.  ( 2 min )
    Limitless stability for Graph Convolutional Networks. (arXiv:2301.11443v1 [cs.LG])
    This work establishes rigorous, novel and widely applicable stability guarantees and transferability bounds for graph convolutional networks -- without reference to any underlying limit object or statistical distribution. Crucially, utilized graph-shift operators (GSOs) are not necessarily assumed to be normal, allowing for the treatment of networks on both directed- and for the first time also undirected graphs. Stability to node-level perturbations is related to an 'adequate (spectral) covering' property of the filters in each layer. Stability to edge-level perturbations is related to Lipschitz constants and newly introduced semi-norms of filters. Results on stability to topological perturbations are obtained through recently developed mathematical-physics based tools. As an important and novel example, it is showcased that graph convolutional networks are stable under graph-coarse-graining procedures (replacing strongly-connected sub-graphs by single nodes) precisely if the GSO is the graph Laplacian and filters are regular at infinity. These new theoretical results are supported by corresponding numerical investigations.  ( 2 min )
    Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective. (arXiv:2301.11535v1 [cs.LG])
    Multivariate time series (MTS) forecasting has penetrated and benefited our daily life. However, the unfair forecasting of MTSs not only degrades their practical benefit but even brings about serious potential risk. Such unfair MTS forecasting may be attributed to variable disparity leading to advantaged and disadvantaged variables. This issue has rarely been studied in the existing MTS forecasting models. To address this significant gap, we formulate the MTS fairness modeling problem as learning informative representations attending to both advantaged and disadvantaged variables. Accordingly, we propose a novel framework, named FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial learning to generate both group-irrelevant and -relevant representations for the downstream forecasting. FairFor first adopts the recurrent graph convolution to capture spatio-temporal variable correlations and to group variables by leveraging a spectral relaxation of the K-means objective. Then, it utilizes a novel filtering & fusion module to filter the group-relevant information and generate group-irrelevant representations by orthogonality regularization. The group-irrelevant and -relevant representations form highly informative representations, facilitating to share the knowledge from advantaged variables to disadvantaged variables and guarantee fairness. Extensive experiments on four public datasets demonstrate the FairFor effectiveness for fair forecasting and significant performance improvement.  ( 2 min )
    Policy Optimization with Robustness Certificates. (arXiv:2301.11374v1 [cs.LG])
    We present a policy optimization framework in which the learned policy comes with a machine-checkable certificate of adversarial robustness. Our approach, called CAROL, learns a model of the environment. In each learning iteration, it uses the current version of this model and an external abstract interpreter to construct a differentiable signal for provable robustness. This signal is used to guide policy learning, and the abstract interpretation used to construct it directly leads to the robustness certificate returned at convergence. We give a theoretical analysis that bounds the worst-case accumulative reward of CAROL. We also experimentally evaluate CAROL on four MuJoCo environments. On these tasks, which involve continuous state and action spaces, CAROL learns certified policies that have performance comparable to the (non-certified) policies learned using state-of-the-art robust RL methods.  ( 2 min )
    Diffusion Denoising for Low-Dose-CT Model. (arXiv:2301.11482v1 [eess.IV])
    Low-dose Computed Tomography (LDCT) reconstruction is an important task in medical image analysis. Recent years have seen many deep learning based methods, proved to be effective in this area. However, these methods mostly follow a supervised architecture, which needs paired CT image of full dose and quarter dose, and the solution is highly dependent on specific measurements. In this work, we introduce Denoising Diffusion LDCT Model, dubbed as DDLM, generating noise-free CT image using conditioned sampling. DDLM uses pretrained model, and need no training nor tuning process, thus our proposal is in unsupervised manner. Experiments on LDCT images have shown comparable performance of DDLM using less inference time, surpassing other state-of-the-art methods, proving both accurate and efficient. Implementation code will be set to public soon.  ( 2 min )
    Rigid body flows for sampling molecular crystal structures. (arXiv:2301.11355v1 [cs.LG])
    Normalizing flows (NF) are a class of powerful generative models that have gained popularity in recent years due to their ability to model complex distributions with high flexibility and expressiveness. In this work, we introduce a new type of normalizing flow that is tailored for modeling positions and orientations of multiple objects in three-dimensional space, such as molecules in a crystal. Our approach is based on two key ideas: first, we define smooth and expressive flows on the group of unit quaternions, which allows us to capture the continuous rotational motion of rigid bodies; second, we use the double cover property of unit quaternions to define a proper density on the rotation group. This ensures that our model can be trained using standard likelihood-based methods or variational inference with respect to a thermodynamic target density. We evaluate the method by training Boltzmann generators for two molecular examples, namely the multi-modal density of a tetrahedral system in an external field and the ice XI phase in the TIP4P-Ew water model. Our flows can be combined with flows operating on the internal degrees of freedom of molecules, and constitute an important step towards the modeling of distributions of many interacting molecules.  ( 2 min )
    Alien Coding. (arXiv:2301.11479v1 [cs.AI])
    We introduce a self-learning algorithm for synthesizing programs for OEIS sequences. The algorithm starts from scratch initially generating programs at random. Then it runs many iterations of a self-learning loop that interleaves (i) training neural machine translation to learn the correspondence between sequences and the programs discovered so far, and (ii) proposing many new programs for each OEIS sequence by the trained neural machine translator. The algorithm discovers on its own programs for more than 78000 OEIS sequences, sometimes developing unusual programming methods. We analyze its behavior and the invented programs in several experiments.  ( 2 min )
    Exploring Deep Reinforcement Learning for Holistic Smart Building Control. (arXiv:2301.11510v1 [cs.LG])
    In this paper, we take a holistic approach to deal with the tradeoffs between energy use and comfort in commercial buildings. We developed a system called OCTOPUS, which employs a novel deep reinforcement learning (DRL) framework that uses a data-driven approach to find the optimal control sequences of all building's subsystems, including HVAC, lighting, blind and window systems. The DRL architecture includes a novel reward function that allows the framework to explore the tradeoffs between energy use and users' comfort, while at the same time enabling the solution of the high-dimensional control problem due to the interactions of four different building subsystems. In order to cope with OCTOPUS's data training requirements, we argue that calibrated simulations that match the target building operational points are the vehicle to generate enough data to be able to train our DRL framework to find the control solution for the target building. In our work, we trained OCTOPUS with 10-year weather data and a building model that is implemented in the EnergyPlus building simulator, which was calibrated using data from a real production building. Through extensive simulations, we demonstrate that OCTOPUS can achieve 14.26% and 8.1% energy savings compared with the state-of-the-art rule-based method in a LEED Gold Certified building and the latest DRL-based method available in the literature respectively, while maintaining human comfort within a desired range.  ( 2 min )
    Machine Learning Approach and Extreme Value Theory to Correlated Stochastic Time Series with Application to Tree Ring Data. (arXiv:2301.11488v1 [stat.ML])
    The main goal of machine learning (ML) is to study and improve mathematical models which can be trained with data provided by the environment to infer the future and to make decisions without necessarily having complete knowledge of all influencing elements. In this work, we describe how ML can be a powerful tool in studying climate modeling. Tree ring growth was used as an implementation in different aspects, for example, studying the history of buildings and environment. By growing and via the time, a new layer of wood to beneath its bark by the tree. After years of growing, time series can be applied via a sequence of tree ring widths. The purpose of this paper is to use ML algorithms and Extreme Value Theory in order to analyse a set of tree ring widths data from nine trees growing in Nottinghamshire. Initially, we start by exploring the data through a variety of descriptive statistical approaches. Transforming data is important at this stage to find out any problem in modelling algorithm. We then use algorithm tuning and ensemble methods to improve the k-nearest neighbors (KNN) algorithm. A comparison between the developed method in this study ad other methods are applied. Also, extreme value of the dataset will be more investigated. The results of the analysis study show that the ML algorithms in the Random Forest method would give accurate results in the analysis of tree ring widths data from nine trees growing in Nottinghamshire with the lowest Root Mean Square Error value. Also, we notice that as the assumed ARMA model parameters increased, the probability of selecting the true model also increased. In terms of the Extreme Value Theory, the Weibull distribution would be a good choice to model tree ring data.  ( 2 min )
    Neural networks learn to magnify areas near decision boundaries. (arXiv:2301.11375v1 [cs.LG])
    We study how training molds the Riemannian geometry induced by neural network feature maps. At infinite width, neural networks with random parameters induce highly symmetric metrics on input space. Feature learning in networks trained to perform classification tasks magnifies local areas along decision boundaries. These changes are consistent with previously proposed geometric approaches for hand-tuning of kernel methods to improve generalization.  ( 2 min )
    Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for Downstream Tasks. (arXiv:2301.11560v1 [cs.LG])
    As a few large-scale pre-trained models become the major choices of various applications, new challenges arise for model pruning, e.g., can we avoid pruning the same model from scratch for every downstream task? How to reuse the pruning results of previous tasks to accelerate the pruning for a new task? To address these challenges, we create a small model for a new task from the pruned models of similar tasks. We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task. We study this ''meta-pruning'' from nearest tasks on two major classes of pre-trained models, convolutional neural network (CNN) and vision transformer (ViT), under a limited budget of pruning iterations. Our study begins by investigating the overlap of pruned models for similar tasks and how the overlap changes over different layers and blocks. Inspired by these discoveries, we develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task by initializing a sub-network from the pruned models of its nearest tasks. In experiments, we demonstrate MVP's advantages in accuracy, efficiency, and generalization through extensive empirical studies and comparisons with popular pruning methods over several datasets.  ( 2 min )
    Model-based Offline Reinforcement Learning with Local Misspecification. (arXiv:2301.11426v1 [cs.LG])
    We present a model-based offline reinforcement learning policy performance lower bound that explicitly captures dynamics model misspecification and distribution mismatch and we propose an empirical algorithm for optimal offline policy selection. Theoretically, we prove a novel safe policy improvement theorem by establishing pessimism approximations to the value function. Our key insight is to jointly consider selecting over dynamics models and policies: as long as a dynamics model can accurately represent the dynamics of the state-action pairs visited by a given policy, it is possible to approximate the value of that particular policy. We analyze our lower bound in the LQR setting and also show competitive performance to previous lower bounds on policy selection across a set of D4RL tasks.  ( 2 min )
    Direct Parameterization of Lipschitz-Bounded Deep Networks. (arXiv:2301.11526v1 [cs.LG])
    This paper introduces a new parameterization of deep neural networks (both fully-connected and convolutional) with guaranteed Lipschitz bounds, i.e. limited sensitivity to perturbations. The Lipschitz guarantees are equivalent to the tightest-known bounds based on certification via a semidefinite program (SDP), which does not scale to large models. In contrast to the SDP approach, we provide a ``direct'' parameterization, i.e. a smooth mapping from $\mathbb R^N$ onto the set of weights of Lipschitz-bounded networks. This enables training via standard gradient methods, without any computationally intensive projections or barrier terms. The new parameterization can equivalently be thought of as either a new layer type (the \textit{sandwich layer}), or a novel parameterization of standard feedforward networks with parameter sharing between neighbouring layers. We illustrate the method with some applications in image classification (MNIST and CIFAR-10).  ( 2 min )
    Graph Scattering beyond Wavelet Shackles. (arXiv:2301.11456v1 [cs.LG])
    This work develops a flexible and mathematically sound framework for the design and analysis of graph scattering networks with variable branching ratios and generic functional calculus filters. Spectrally-agnostic stability guarantees for node- and graph-level perturbations are derived; the vertex-set non-preserving case is treated by utilizing recently developed mathematical-physics based tools. Energy propagation through the network layers is investigated and related to truncation stability. New methods of graph-level feature aggregation are introduced and stability of the resulting composite scattering architectures is established. Finally, scattering transforms are extended to edge- and higher order tensorial input. Theoretical results are complemented by numerical investigations: Suitably chosen cattering networks conforming to the developed theory perform better than traditional graph-wavelet based scattering approaches in social network graph classification tasks and significantly outperform other graph-based learning approaches to regression of quantum-chemical energies on QM7.  ( 2 min )
    MG-GNN: Multigrid Graph Neural Networks for Learning Multilevel Domain Decomposition Methods. (arXiv:2301.11378v1 [cs.LG])
    Domain decomposition methods (DDMs) are popular solvers for discretized systems of partial differential equations (PDEs), with one-level and multilevel variants. These solvers rely on several algorithmic and mathematical parameters, prescribing overlap, subdomain boundary conditions, and other properties of the DDM. While some work has been done on optimizing these parameters, it has mostly focused on the one-level setting or special cases such as structured-grid discretizations with regular subdomain construction. In this paper, we propose multigrid graph neural networks (MG-GNN), a novel GNN architecture for learning optimized parameters in two-level DDMs\@. We train MG-GNN using a new unsupervised loss function, enabling effective training on small problems that yields robust performance on unstructured grids that are orders of magnitude larger than those in the training set. We show that MG-GNN outperforms popular hierarchical graph network architectures for this optimization and that our proposed loss function is critical to achieving this improved performance.  ( 2 min )
    Rethinking 1x1 Convolutions: Can we train CNNs with Frozen Random Filters?. (arXiv:2301.11360v1 [cs.CV])
    Modern CNNs are learning the weights of vast numbers of convolutional operators. In this paper, we raise the fundamental question if this is actually necessary. We show that even in the extreme case of only randomly initializing and never updating spatial filters, certain CNN architectures can be trained to surpass the accuracy of standard training. By reinterpreting the notion of pointwise ($1\times 1$) convolutions as an operator to learn linear combinations (LC) of frozen (random) spatial filters, we are able to analyze these effects and propose a generic LC convolution block that allows tuning of the linear combination rate. Empirically, we show that this approach not only allows us to reach high test accuracies on CIFAR and ImageNet but also has favorable properties regarding model robustness, generalization, sparsity, and the total number of necessary weights. Additionally, we propose a novel weight sharing mechanism, which allows sharing of a single weight tensor between all spatial convolution layers to massively reduce the number of weights.  ( 2 min )
    Estimating Causal Effects using a Multi-task Deep Ensemble. (arXiv:2301.11351v1 [cs.LG])
    Over the past few decades, a number of methods have been proposed for causal effect estimation, yet few have been demonstrated to be effective in handling data with complex structures, such as images. To fill this gap, we propose a Causal Multi-task Deep Ensemble (CMDE) framework to learn both shared and group-specific information from the study population and prove its equivalence to a multi-task Gaussian process (GP) with coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. We evaluate our method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on a majority of these tasks.  ( 2 min )
    Learning Vortex Dynamics for Fluid Inference and Prediction. (arXiv:2301.11494v1 [cs.LG])
    We propose a novel machine learning method based on differentiable vortex particles to infer and predict fluid dynamics from a single video. The key design of our system is a particle-based latent space to encapsulate the hidden, Lagrangian vortical evolution underpinning the observable, Eulerian flow phenomena. We devise a novel differentiable vortex particle system in conjunction with their learnable, vortex-to-velocity dynamics mapping to effectively capture and represent the complex flow features in a reduced space. We further design an end-to-end training pipeline to directly learn and synthesize simulators from data, that can reliably deliver future video rollouts based on limited observation. The value of our method is twofold: first, our learned simulator enables the inference of hidden physics quantities (e.g. velocity field) purely from visual observation, to be used for motion analysis; secondly, it also supports future prediction, constructing the input video's sequel along with its future dynamics evolution. We demonstrate our method's efficacy by comparing quantitatively and qualitatively with a range of existing methods on both synthetic and real-world videos, displaying improved data correspondence, visual plausibility, and physical integrity.  ( 2 min )
    Revisiting Discriminative Entropy Clustering and its relation to K-means. (arXiv:2301.11405v1 [cs.LG])
    Maximization of mutual information between the model's input and output is formally related to "decisiveness" and "fairness" of the softmax predictions, motivating such unsupervised entropy-based losses for discriminative neural networks. Recent self-labeling methods based on such losses represent the state of the art in deep clustering. However, some important properties of entropy clustering are not well-known, or even misunderstood. For example, we provide a counterexample to prior claims about equivalence to variance clustering (K-means) and point out technical mistakes in such theories. We discuss the fundamental differences between these discriminative and generative clustering approaches. Moreover, we show the susceptibility of standard entropy clustering to narrow margins and motivate an explicit margin maximization term. We also propose an improved self-labeling loss; it is robust to pseudo-labeling errors and enforces stronger fairness. We develop an EM algorithm for our loss that is significantly faster than the standard alternatives. Our results improve the state-of-the-art on standard benchmarks.  ( 2 min )
    Multi-limb Split Learning for Tumor Classification on Vertically Distributed Data. (arXiv:2301.11468v1 [eess.IV])
    Brain tumors are one of the life-threatening forms of cancer. Previous studies have classified brain tumors using deep neural networks. In this paper, we perform the later task using a collaborative deep learning technique, more specifically split learning. Split learning allows collaborative learning via neural networks splitting into two (or more) parts, a client-side network and a server-side network. The client-side is trained to a certain layer called the cut layer. Then, the rest of the training is resumed on the server-side network. Vertical distribution, a method for distributing data among organizations, was implemented where several hospitals hold different attributes of information for the same set of patients. To the best of our knowledge this paper will be the first paper to implement both split learning and vertical distribution for brain tumor classification. Using both techniques, we were able to achieve train and test accuracy greater than 90\% and 70\%, respectively.  ( 2 min )
    Are Equivariant Equilibrium Approximators Beneficial?. (arXiv:2301.11481v1 [cs.GT])
    Recently, remarkable progress has been made by approximating Nash equilibrium (NE), correlated equilibrium (CE), and coarse correlated equilibrium (CCE) through function approximation that trains a neural network to predict equilibria from game representations. Furthermore, equivariant architectures are widely adopted in designing such equilibrium approximators in normal-form games. In this paper, we theoretically characterize benefits and limitations of equivariant equilibrium approximators. For the benefits, we show that they enjoy better generalizability than general ones and can achieve better approximations when the payoff distribution is permutation-invariant. For the limitations, we discuss their drawbacks in terms of equilibrium selection and social welfare. Together, our results help to understand the role of equivariance in equilibrium approximators.  ( 2 min )
    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing. (arXiv:2301.11500v1 [cs.LG])
    It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.  ( 2 min )
    Learning Modulo Theories. (arXiv:2301.11435v1 [cs.LG])
    Recent techniques that integrate \emph{solver layers} into Deep Neural Networks (DNNs) have shown promise in bridging a long-standing gap between inductive learning and symbolic reasoning techniques. In this paper we present a set of techniques for integrating \emph{Satisfiability Modulo Theories} (SMT) solvers into the forward and backward passes of a deep network layer, called SMTLayer. Using this approach, one can encode rich domain knowledge into the network in the form of mathematical formulas. In the forward pass, the solver uses symbols produced by prior layers, along with these formulas, to construct inferences; in the backward pass, the solver informs updates to the network, driving it towards representations that are compatible with the solver's theory. Notably, the solver need not be differentiable. We implement \layername as a Pytorch module, and our empirical results show that it leads to models that \emph{1)} require fewer training samples than conventional models, \emph{2)} that are robust to certain types of covariate shift, and \emph{3)} that ultimately learn representations that are consistent with symbolic knowledge, and thus naturally interpretable.  ( 2 min )
    Causal Structural Learning from Time Series: A Convex Optimization Approach. (arXiv:2301.11336v1 [cs.LG])
    Structural learning, which aims to learn directed acyclic graphs (DAGs) from observational data, is foundational to causal reasoning and scientific discovery. Recent advancements formulate structural learning into a continuous optimization problem; however, DAG learning remains a highly non-convex problem, and there has not been much work on leveraging well-developed convex optimization techniques for causal structural learning. We fill this gap by proposing a data-adaptive linear approach for causal structural learning from time series data, which can be conveniently cast into a convex optimization problem using a recently developed monotone operator variational inequality (VI) formulation. Furthermore, we establish non-asymptotic recovery guarantee of the VI-based approach and show the superior performance of our proposed method on structure recovery over existing methods via extensive numerical experiments.  ( 2 min )
    A Simple Algorithm For Scaling Up Kernel Methods. (arXiv:2301.11414v1 [cs.LG])
    The recent discovery of the equivalence between infinitely wide neural networks (NNs) in the lazy training regime and Neural Tangent Kernels (NTKs) (Jacot et al., 2018) has revived interest in kernel methods. However, conventional wisdom suggests kernel methods are unsuitable for large samples due to their computational complexity and memory requirements. We introduce a novel random feature regression algorithm that allows us (when necessary) to scale to virtually infinite numbers of random features. We illustrate the performance of our method on the CIFAR-10 dataset.  ( 2 min )
    Coincident Learning for Unsupervised Anomaly Detection. (arXiv:2301.11368v1 [cs.LG])
    Anomaly detection is an important task for complex systems (e.g., industrial facilities, manufacturing, large-scale science experiments), where failures in a sub-system can lead to low yield, faulty products, or even damage to components. While complex systems often have a wealth of data, labeled anomalies are typically rare (or even nonexistent) and expensive to acquire. In this paper, we introduce a new method, called CoAD, for training anomaly detection models on unlabeled data, based on the expectation that anomalous behavior in one sub-system will produce coincident anomalies in downstream sub-systems and products. Given data split into two streams $s$ and $q$ (i.e., subsystem diagnostics and final product quality), we define an unsupervised metric, $\hat{F}_\beta$, out of analogy to the supervised classification $F_\beta$ statistic, which quantifies the performance of the independent anomaly detection algorithms on s and q based on their coincidence rate. We demonstrate our method in four cases: a synthetic time-series data set, a synthetic imaging data set generated from MNIST, a metal milling data set, and a data set taken from a particle accelerator.  ( 2 min )
    Causal Bandits without Graph Learning. (arXiv:2301.11401v1 [stat.ML])
    We study the causal bandit problem when the causal graph is unknown and develop an efficient algorithm for finding the parent node of the reward node using atomic interventions. We derive the exact equation for the expected number of interventions performed by the algorithm and show that under certain graphical conditions it could perform either logarithmically fast or, under more general assumptions, slower but still sublinearly in the number of variables. We formally show that our algorithm is optimal as it meets the universal lower bound we establish for any algorithm that performs atomic interventions. Finally, we extend our algorithm to the case when the reward node has multiple parents. Using this algorithm together with a standard algorithm from bandit literature leads to improved regret bounds.  ( 2 min )
    A Hybrid Deep Neural Operator/Finite Element Method for Ice-Sheet Modeling. (arXiv:2301.11402v1 [physics.comp-ph])
    One of the most challenging and consequential problems in climate modeling is to provide probabilistic projections of sea level rise. A large part of the uncertainty of sea level projections is due to uncertainty in ice sheet dynamics. At the moment, accurate quantification of the uncertainty is hindered by the cost of ice sheet computational models. In this work, we develop a hybrid approach to approximate existing ice sheet computational models at a fraction of their cost. Our approach consists of replacing the finite element model for the momentum equations for the ice velocity, the most expensive part of an ice sheet model, with a Deep Operator Network, while retaining a classic finite element discretization for the evolution of the ice thickness. We show that the resulting hybrid model is very accurate and it is an order of magnitude faster than the traditional finite element model. Further, a distinctive feature of the proposed model compared to other neural network approaches, is that it can handle high-dimensional parameter spaces (parameter fields) such as the basal friction at the bed of the glacier, and can therefore be used for generating samples for uncertainty quantification. We study the impact of hyper-parameters, number of unknowns and correlation length of the parameter distribution on the training and accuracy of the Deep Operator Network on a synthetic ice sheet model. We then target the evolution of the Humboldt glacier in Greenland and show that our hybrid model can provide accurate statistics of the glacier mass loss and can be effectively used to accelerate the quantification of uncertainty.  ( 2 min )
  • Open

    Differential Privacy has Bounded Impact on Fairness in Classification. (arXiv:2210.16242v2 [cs.LG] UPDATED)
    We theoretically study the impact of differential privacy on fairness in classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on accuracy conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.
    Achieving Risk Control in Online Learning Settings. (arXiv:2205.09095v7 [cs.LG] UPDATED)
    To provide rigorous uncertainty quantification for online learning models, we develop a framework for constructing uncertainty sets that provably control risk -- such as coverage of confidence intervals, false negative rate, or F1 score -- in the online setting. This extends conformal prediction to apply to a larger class of online learning problems. Our method guarantees risk control at any user-specified level even when the underlying data distribution shifts drastically, even adversarially, over time in an unknown fashion. The technique we propose is highly flexible as it can be applied with any base online learning algorithm (e.g., a deep neural network trained online), requiring minimal implementation effort and essentially zero additional computational cost. We further extend our approach to control multiple risks simultaneously, so the prediction sets we generate are valid for all given risks. To demonstrate the utility of our method, we conduct experiments on real-world tabular time-series data sets showing that the proposed method rigorously controls various natural risks. Furthermore, we show how to construct valid intervals for an online image-depth estimation problem that previous sequential calibration schemes cannot handle.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v3 [stat.ML] UPDATED)
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.
    Big portfolio selection by graph-based conditional moments method. (arXiv:2301.11697v1 [stat.ML])
    How to do big portfolio selection is very important but challenging for both researchers and practitioners. In this paper, we propose a new graph-based conditional moments (GRACE) method to do portfolio selection based on thousands of stocks or more. The GRACE method first learns the conditional quantiles and mean of stock returns via a factor-augmented temporal graph convolutional network, which guides the learning procedure through a factor-hypergraph built by the set of stock-to-stock relations from the domain knowledge as well as the set of factor-to-stock relations from the asset pricing knowledge. Next, the GRACE method learns the conditional variance, skewness, and kurtosis of stock returns from the learned conditional quantiles by using the quantiled conditional moment (QCM) method. The QCM method is a supervised learning procedure to learn these conditional higher-order moments, so it largely overcomes the computational difficulty from the classical high-dimensional GARCH-type methods. Moreover, the QCM method allows the mis-specification in modeling conditional quantiles to some extent, due to its regression-based nature. Finally, the GRACE method uses the learned conditional mean, variance, skewness, and kurtosis to construct several performance measures, which are criteria to sort the stocks to proceed the portfolio selection in the well-known 10-decile framework. An application to NASDAQ and NYSE stock markets shows that the GRACE method performs much better than its competitors, particularly when the performance measures are comprised of conditional variance, skewness, and kurtosis.  ( 2 min )
    Personalised Federated Learning On Heterogeneous Feature Spaces. (arXiv:2301.11447v1 [cs.LG])
    Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.  ( 2 min )
    Fine-tuning Neural-Operator architectures for training and generalization. (arXiv:2301.11509v1 [cs.LG])
    In this work, we present an analysis of the generalization of Neural Operators (NOs) and derived architectures. We proposed a family of networks, which we name (${\textit{s}}{\text{NO}}+\varepsilon$), where we modify the layout of NOs towards an architecture resembling a Transformer; mainly, we substitute the Attention module with the Integral Operator part of NOs. The resulting network preserves universality, has a better generalization to unseen data, and similar number of parameters as NOs. On the one hand, we study numerically the generalization by gradually transforming NOs into ${\textit{s}}{\text{NO}}+\varepsilon$ and verifying a reduction of the test loss considering a time-harmonic wave dataset with different frequencies. We perform the following changes in NOs: (a) we split the Integral Operator (non-local) and the (local) feed-forward network (MLP) into different layers, generating a {\it sequential} structure which we call sequential Neural Operator (${\textit{s}}{\text{NO}}$), (b) we add the skip connection, and layer normalization in ${\textit{s}}{\text{NO}}$, and (c) we incorporate dropout and stochastic depth that allows us to generate deep networks. In each case, we observe a decrease in the test loss in a wide variety of initialization, indicating that our changes outperform the NO. On the other hand, building on infinite-dimensional Statistics, and in particular the Dudley Theorem, we provide bounds of the Rademacher complexity of NOs and ${\textit{s}}{\text{NO}}$, and we find the following relationship: the upper bound of the Rademacher complexity of the ${\textit{s}}{\text{NO}}$ is a lower-bound of the NOs, thereby, the generalization error bound of ${\textit{s}}{\text{NO}}$ is smaller than NO, which further strengthens our numerical results.  ( 2 min )
    On the Relationship Between Explanation and Prediction: A Causal View. (arXiv:2212.06925v3 [cs.LG] UPDATED)
    Explainability has become a central requirement for the development, deployment, and adoption of machine learning (ML) models and we are yet to understand what explanation methods can and cannot do. Several factors such as data, model prediction, hyperparameters used in training the model, and random initialization can all influence downstream explanations. While previous work empirically hinted that explanations (E) may have little relationship with the prediction (Y), there is a lack of conclusive study to quantify this relationship. Our work borrows tools from causal inference to systematically assay this relationship. More specifically, we measure the relationship between E and Y by measuring the treatment effect when intervening on their causal ancestors (hyperparameters) (inputs to generate saliency-based Es or Ys). We discover that Y's relative direct influence on E follows an odd pattern; the influence is higher in the lowest-performing models than in mid-performing models, and it then decreases in the top-performing models. We believe our work is a promising first step towards providing better guidance for practitioners who can make more informed decisions in utilizing these explanations by knowing what factors are at play and how they relate to their end task.  ( 2 min )
    SOBER: Scalable Batch Bayesian Optimization and Quadrature using Recombination Constraints. (arXiv:2301.11832v1 [cs.LG])
    Batch Bayesian optimisation (BO) has shown to be a sample-efficient method of performing optimisation where expensive-to-evaluate objective functions can be queried in parallel. However, current methods do not scale to large batch sizes -- a frequent desideratum in practice (e.g. drug discovery or simulation-based inference). We present a novel algorithm, SOBER, which permits scalable and diversified batch BO with arbitrary acquisition functions, arbitrary input spaces (e.g. graph), and arbitrary kernels. The key to our approach is to reformulate batch selection for BO as a Bayesian quadrature (BQ) problem, which offers computational advantages. This reformulation is beneficial in solving BQ tasks reciprocally, which introduces the exploitative functionality of BO to BQ. We show that SOBER offers substantive performance gains in synthetic and real-world tasks, including drug discovery and simulation-based inference.  ( 2 min )
    Integrating Random Effects in Deep Neural Networks. (arXiv:2206.03314v3 [stat.ML] UPDATED)
    Modern approaches to supervised learning like deep neural networks (DNNs) typically implicitly assume that observed responses are statistically independent. In contrast, correlated data are prevalent in real-life large-scale applications, with typical sources of correlation including spatial, temporal and clustering structures. These correlations are either ignored by DNNs, or ad-hoc solutions are developed for specific use cases. We propose to use the mixed models framework to handle correlated data in DNNs. By treating the effects underlying the correlation structure as random effects, mixed models are able to avoid overfitted parameter estimates and ultimately yield better predictive performance. The key to combining mixed models and DNNs is using the Gaussian negative log-likelihood (NLL) as a natural loss function that is minimized with DNN machinery including stochastic gradient descent (SGD). Since NLL does not decompose like standard DNN loss functions, the use of SGD with NLL presents some theoretical and implementation challenges, which we address. Our approach which we call LMMNN is demonstrated to improve performance over natural competitors in various correlation scenarios on diverse simulated and real datasets. Our focus is on a regression setting and tabular datasets, but we also show some results for classification. Our code is available at https://github.com/gsimchoni/lmmnn.  ( 2 min )
    Rethinking Assumptions in Deep Anomaly Detection. (arXiv:2006.00339v3 [cs.LG] UPDATED)
    Though anomaly detection (AD) can be viewed as a classification problem (nominal vs. anomalous) it is usually treated in an unsupervised manner since one typically does not have access to, or it is infeasible to utilize, a dataset that sufficiently characterizes what it means to be "anomalous." In this paper we present results demonstrating that this intuition surprisingly seems not to extend to deep AD on images. For a recent AD benchmark on ImageNet, classifiers trained to discern between normal samples and just a few (64) random natural images are able to outperform the current state of the art in deep AD. Experimentally we discover that the multiscale structure of image data makes example anomalies exceptionally informative.  ( 2 min )
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v2 [cs.LG] UPDATED)
    Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in its most practical form. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. Existing analysis cannot conclude the convergence for such a challenging case. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under the i.i.d. sampling. We develop a novel framework that evaluates and controls the error propagation between actor and critic systematically. To our knowledge, this is the first finite-time analysis for the online single-timescale actor-critic method. Our results compare favorably to the existing literature in terms of considering the most practical yet challenging settings and requiring weaker assumptions.  ( 2 min )
    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions. (arXiv:2301.11885v1 [stat.ML])
    Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error to heavy tails. Very recently, new generalization bounds have been proven, indicating a non-monotonic relationship between the generalization error and heavy tails, which is more pertinent to the reported empirical observations. While these bounds do not require additional topological assumptions given that SGD can be modeled using a heavy-tailed stochastic differential equation (SDE), they can only apply to simple quadratic problems. In this paper, we build on this line of research and develop generalization bounds for a more general class of objective functions, which includes non-convex functions as well. Our approach is based on developing Wasserstein stability bounds for heavy-tailed SDEs and their discretizations, which we then convert to generalization bounds. Our results do not require any nontrivial assumptions; yet, they shed more light to the empirical observations, thanks to the generality of the loss functions.  ( 2 min )
    Conformal inference is (almost) free for neural networks trained with early stopping. (arXiv:2301.11556v1 [stat.ML])
    Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks. Models trained with early stopping often provide relatively accurate predictions, but they generally still lack precise statistical guarantees unless they are further calibrated using independent hold-out data. This paper addresses the above limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data. This leads to models that are both accurate and able to provide exact predictive inferences without multiple data splits nor overly conservative adjustments. Practical implementations are developed for different learning tasks -- outlier detection, multi-class classification, regression -- and their competitive performance is demonstrated on real data.  ( 2 min )
    Robust Multi-Agent Bandits Over Undirected Graphs. (arXiv:2203.00076v2 [cs.LG] UPDATED)
    We consider a multi-agent multi-armed bandit setting in which $n$ honest agents collaborate over a network to minimize regret but $m$ malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur $O( (m + K/n) \log (T) / \Delta )$ regret in this setting, where $K$ is the number of arms and $\Delta$ is the arm gap. For $m \ll K$, this improves over the single-agent baseline regret of $O(K\log(T)/\Delta)$. In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in $K$ and $n$. In light of this negative result, we propose a new algorithm for which the $i$-th agent has regret $O( ( d_{\text{mal}}(i) + K/n) \log(T)/\Delta)$ on any connected and undirected graph, where $d_{\text{mal}}(i)$ is the number of $i$'s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where $d_{\text{mal}}(i) = m$), and show the effect of malicious agents is entirely local (in the sense that only the $d_{\text{mal}}(i)$ malicious agents directly connected to $i$ affect its long-term regret).  ( 2 min )
    Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation. (arXiv:2209.06620v3 [cs.LG] UPDATED)
    Among the reasons hindering reinforcement learning (RL) applications to real-world problems, two factors are critical: limited data and the mismatch between the testing environment (real environment in which the policy is deployed) and the training environment (e.g., a simulator). This paper attempts to address these issues simultaneously with distributionally robust offline RL, where we learn a distributionally robust policy using historical data obtained from the source environment by optimizing against a worst-case perturbation thereof. In particular, we move beyond tabular settings and consider linear function approximation. More specifically, we consider two settings, one where the dataset is well-explored and the other where the dataset has sufficient coverage of the optimal policy. We propose two algorithms~-- one for each of the two settings~-- that achieve error bounds $\tilde{O}(d^{1/2}/N^{1/2})$ and $\tilde{O}(d^{3/2}/N^{1/2})$ respectively, where $d$ is the dimension in the linear function approximation and $N$ is the number of trajectories in the dataset. To the best of our knowledge, they provide the first non-asymptotic results of the sample complexity in this setting. Diverse experiments are conducted to demonstrate our theoretical findings, showing the superiority of our algorithm against the non-robust one.  ( 2 min )
    Statistical Inference for the Dynamic Time Warping Distance, with Application to Abnormal Time-Series Detection. (arXiv:2202.06593v2 [stat.ML] UPDATED)
    We study statistical inference on the similarity/distance between two time-series under uncertain environment by considering a statistical hypothesis test on the distance obtained from Dynamic Time Warping (DTW) algorithm. The sampling distribution of the DTW distance is too difficult to derive because it is obtained based on the solution of the DTW algorithm, which is complicated. To circumvent this difficulty, we propose to employ the conditional selective inference framework, which enables us to derive a valid inference method on the DTW distance. To our knowledge, this is the first method that can provide a valid p-value to quantify the statistical significance of the DTW distance, which is helpful for high-stake decision making such as abnormal time-series detection problems. We evaluate the performance of the proposed inference method on both synthetic and real-world datasets.  ( 2 min )
    Learning the Dynamics of Sparsely Observed Interacting Systems. (arXiv:2301.11647v1 [stat.ML])
    We address the problem of learning the dynamics of an unknown non-parametric system linking a target and a feature time series. The feature time series is measured on a sparse and irregular grid, while we have access to only a few points of the target time series. Once learned, we can use these dynamics to predict values of the target from the previous values of the feature time series. We frame this task as learning the solution map of a controlled differential equation (CDE). By leveraging the rich theory of signatures, we are able to cast this non-linear problem as a high-dimensional linear regression. We provide an oracle bound on the prediction error which exhibits explicit dependencies on the individual-specific sampling schemes. Our theoretical results are illustrated by simulations which show that our method outperforms existing algorithms for recovering the full time series while being computationally cheap. We conclude by demonstrating its potential on real-world epidemiological data.  ( 2 min )
    Feasibility and Transferability of Transfer Learning: A Mathematical Framework. (arXiv:2301.11542v1 [cs.LG])
    Transfer learning is an emerging and popular paradigm for utilizing existing knowledge from previous learning tasks to improve the performance of new ones. Despite its numerous empirical successes, theoretical analysis for transfer learning is limited. In this paper we build for the first time, to the best of our knowledge, a mathematical framework for the general procedure of transfer learning. Our unique reformulation of transfer learning as an optimization problem allows for the first time, analysis of its feasibility. Additionally, we propose a novel concept of transfer risk to evaluate transferability of transfer learning. Our numerical studies using the Office-31 dataset demonstrate the potential and benefits of incorporating transfer risk in the evaluation of transfer learning performance.  ( 2 min )
    Single-Trajectory Distributionally Robust Reinforcement Learning. (arXiv:2301.11721v1 [stat.ML])
    As a framework for sequential decision-making, Reinforcement Learning (RL) has been regarded as an essential component leading to Artificial General Intelligence (AGI). However, RL is often criticized for having the same training environment as the test one, which also hinders its application in the real world. To mitigate this problem, Distributionally Robust RL (DRRL) is proposed to improve the worst performance in a set of environments that may contain the unknown test environment. Due to the nonlinearity of the robustness goal, most of the previous work resort to the model-based approach, learning with either an empirical distribution learned from the data or a simulator that can be sampled infinitely, which limits their applications in simple dynamics environments. In contrast, we attempt to design a DRRL algorithm that can be trained along a single trajectory, i.e., no repeated sampling from a state. Based on the standard Q-learning, we propose distributionally robust Q-learning with the single trajectory (DRQ) and its average-reward variant named differential DRQ. We provide asymptotic convergence guarantees and experiments for both settings, demonstrating their superiority in the perturbed environments against the non-robust ones.  ( 2 min )
    I Prefer not to Say: Are Users Penalized for Protecting Personal Data?. (arXiv:2210.13954v3 [cs.LG] UPDATED)
    We examine the problem of obtaining fair outcomes for individuals who choose to share optional information with machine-learned models and those who do not consent and keep their data undisclosed. We find that these non-consenting users receive significantly lower prediction outcomes than justified by their provided information alone. This observation gives rise to the overlooked problem of how to ensure that users, who protect their personal data, are not penalized. While statistical fairness notions focus on fair outcomes between advantaged and disadvantaged groups, these fairness notions fail to protect the non-consenting users. To address this problem, we formalize protection requirements for models which (i) allow users to benefit from sharing optional information and (ii) do not penalize them if they keep their data undisclosed. We offer the first solution to this problem by proposing the notion of Optional Feature Fairness (OFF), which we prove to be loss-optimal under our protection requirements (i) and (ii). To learn OFF-compliant models, we devise a model-agnostic data augmentation strategy with finite sample convergence guarantees. Finally, we extensively analyze OFF on a variety of challenging real-world tasks, models, and data sets with multiple optional features.  ( 2 min )
    Bi-stochastically normalized graph Laplacian: convergence to manifold Laplacian and robustness to outlier noise. (arXiv:2206.11386v2 [math.ST] UPDATED)
    Bi-stochastic normalization provides an alternative normalization of graph Laplacians in graph-based data analysis and can be computed efficiently by Sinkhorn-Knopp (SK) iterations. This paper proves the convergence of bi-stochastically normalized graph Laplacian to manifold (weighted-)Laplacian with rates, when $n$ data points are i.i.d. sampled from a general $d$-dimensional manifold embedded in a possibly high-dimensional space. Under certain joint limit of $n \to \infty$ and kernel bandwidth $\epsilon \to 0$, the point-wise convergence rate of the graph Laplacian operator (under 2-norm) is proved to be $ O( n^{-1/(d/2+3)})$ at finite large $n$ up to log factors, achieved at the scaling of $\epsilon \sim n^{-1/(d/2+3)} $. When the manifold data are corrupted by outlier noise, we theoretically prove the graph Laplacian point-wise consistency which matches the rate for clean manifold data plus an additional term proportional to the boundedness of the inner-products of the noise vectors among themselves and with data vectors. Motivated by our analysis, which suggests that not exact bi-stochastic normalization but an approximate one will achieve the same consistency rate, we propose an approximate and constrained matrix scaling problem that can be solved by SK iterations with early termination. Numerical experiments support our theoretical results and show the robustness of bi-stochastically normalized graph Laplacian to high-dimensional outlier noise.  ( 2 min )
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v3 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.  ( 2 min )
    Convergence of Batch Updating Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results. (arXiv:2209.05372v2 [math.OC] UPDATED)
    In this paper, we present a unified and general framework for analyzing the batch updating approach to nonlinear, high-dimensional optimization. The framework encompasses all the currently used batch updating approaches, and is applicable to nonconvex as well as convex functions. Moreover, the framework permits the use of noise-corrupted gradients, as well as first-order approximations to the gradient (sometimes referred to as "gradient-free" approaches). By viewing the analysis of the iterations as a problem in the convergence of stochastic processes, we are able to establish a very general theorem, which includes most known convergence results for zeroth-order and first-order methods. The analysis of "second-order" or momentum-based methods is not a part of this paper, and will be studied elsewhere. However, numerical experiments indicate that momentum-based methods can fail if the true gradient is replaced by its first-order approximation. This requires further theoretical analysis.  ( 2 min )
    Explaining Patterns in Data with Language Models via Interpretable Autoprompting. (arXiv:2210.01848v2 [cs.LG] UPDATED)
    Large language models (LLMs) have displayed an impressive ability to harness natural language to perform complex tasks. In this work, we explore whether we can leverage this learned ability to find and explain patterns in data. Specifically, given a pre-trained LLM and data examples, we introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt iteratively alternates between generating explanations with an LLM and reranking them based on their performance when used as a prompt. Experiments on a wide range of datasets, from synthetic mathematics to natural-language understanding, show that iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Moreover, the prompts produced by iPrompt are simultaneously human-interpretable and highly effective for generalization: on real-world sentiment classification datasets, iPrompt produces prompts that match or even improve upon human-written prompts for GPT-3. Finally, experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery. All code for using the methods and data here is made available on Github.  ( 2 min )
    A Deep Learning Method for Comparing Bayesian Hierarchical Models. (arXiv:2301.11873v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled approach for assessing the relative merits of competing computational models and propagating uncertainty into model selection decisions. However, BMC is often intractable for the popular class of hierarchical models due to their high-dimensional nested parameter structure. To address this intractability, we propose a deep learning method for performing BMC on any set of hierarchical models which can be instantiated as probabilistic programs. Since our method enables amortized inference, it allows efficient re-estimation of posterior model probabilities and fast performance validation prior to any real-data application. In a series of extensive validation studies, we benchmark the performance of our method against the state-of-the-art bridge sampling method and demonstrate excellent amortized inference across all BMC settings. We then use our method to compare four hierarchical evidence accumulation models that have previously been deemed intractable for BMC due to partly implicit likelihoods. In this application, we corroborate evidence for the recently proposed L\'evy flight model of decision-making and show how transfer learning can be leveraged to enhance training efficiency. Reproducible code for all analyses is provided.  ( 2 min )
    CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators. (arXiv:2210.06812v2 [cs.LG] UPDATED)
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to utilize any trained classifier to estimate: (1) A consensus label for each example that aggregates the available annotations; (2) A confidence score for how likely each consensus label is correct; (3) A rating for each annotator quantifying the overall correctness of their labels. Existing algorithms to estimate related quantities in crowdsourcing often rely on sophisticated generative models with iterative inference. CROWDLAB instead uses a straightforward weighted ensemble. Existing algorithms often rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB utilizes any classifier model trained on these features, and can thus better generalize between examples with similar features. On real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than existing algorithms like Dawid-Skene/GLAD.  ( 2 min )
    Synthetic A/B Testing using Synthetic Interventions. (arXiv:2006.07691v5 [econ.EM] UPDATED)
    Suppose there are $N$ units and $D$ interventions. We aim to learn the average potential outcome associated with every unit-intervention pair, i.e., $N \times D$ causal parameters. While running $N \times D$ experiments is conceivable, it can be expensive or infeasible. This work introduces an experiment design, synthetic A/B testing, and the synthetic interventions (SI) estimator to recover all $N \times D$ causal parameters while observing each unit under at most two interventions, independent of $D$. Under a novel tensor factor model for potential outcomes across units, measurements, and interventions, we establish the identification of each parameter. Further, we show the SI estimator is finite-sample consistent and asymptotically normal. Collectively, these also lead to novel results for panel data settings, particularly for synthetic controls. We empirically validate our experiment design using real e-commerce data from a large-scale A/B test.  ( 2 min )
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v2 [stat.ML] UPDATED)
    We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.  ( 2 min )
    FedPop: A Bayesian Approach for Personalised Federated Learning. (arXiv:2206.03611v2 [cs.LG] UPDATED)
    Personalised federated learning (FL) aims at collaboratively learning a machine learning model taylored for each client. Albeit promising advances have been made in this direction, most of existing approaches works do not allow for uncertainty quantification which is crucial in many applications. In addition, personalisation in the cross-device setting still involves important issues, especially for new clients or those having small number of observations. This paper aims at filling these gaps. To this end, we propose a novel methodology coined FedPop by recasting personalised FL into the population modeling paradigm where clients' models involve fixed common population parameters and random effects, aiming at explaining data heterogeneity. To derive convergence guarantees for our scheme, we introduce a new class of federated stochastic optimisation algorithms which relies on Markov chain Monte Carlo methods. Compared to existing personalised FL methods, the proposed methodology has important benefits: it is robust to client drift, practical for inference on new clients, and above all, enables uncertainty quantification under mild computational and memory overheads. We provide non-asymptotic convergence guarantees for the proposed algorithms and illustrate their performances on various personalised federated learning tasks.  ( 2 min )
    Constrained Clustering: General Pairwise and Cardinality Constraints. (arXiv:1907.10410v2 [cs.LG] UPDATED)
    We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.  ( 2 min )
    Fast Bayesian Inference with Batch Bayesian Quadrature via Kernel Recombination. (arXiv:2206.04734v4 [cs.LG] UPDATED)
    Calculation of Bayesian posteriors and model evidences typically requires numerical integration. Bayesian quadrature (BQ), a surrogate-model-based approach to numerical integration, is capable of superb sample efficiency, but its lack of parallelisation has hindered its practical applications. In this work, we propose a parallelised (batch) BQ method, employing techniques from kernel quadrature, that possesses an empirically exponential convergence rate. Additionally, just as with Nested Sampling, our method permits simultaneous inference of both posteriors and model evidence. Samples from our BQ surrogate model are re-selected to give a sparse set of samples, via a kernel recombination algorithm, requiring negligible additional time to increase the batch size. Empirically, we find that our approach significantly outperforms the sampling efficiency of both state-of-the-art BQ techniques and Nested Sampling in various real-world datasets, including lithium-ion battery analytics.  ( 2 min )
    Lifelong Reinforcement Learning with Modulating Masks. (arXiv:2212.11110v2 [cs.LG] UPDATED)
    Lifelong learning aims to create AI systems that continuously and incrementally learn during a lifetime, similar to biological learning. Attempts so far have met problems, including catastrophic forgetting, interference among tasks, and the inability to exploit previous knowledge. While considerable research has focused on learning multiple input distributions, typically in classification, lifelong reinforcement learning (LRL) must also deal with variations in the state and transition distributions, and in the reward functions. Modulating masks, recently developed for classification, are particularly suitable to deal with such a large spectrum of task variations. In this paper, we adapted modulating masks to work with deep LRL, specifically PPO and IMPALA agents. The comparison with LRL baselines in both discrete and continuous RL tasks shows superior performance. We further investigated the use of a linear combination of previously learned masks to exploit previous knowledge when learning new tasks: not only is learning faster, the algorithm solves tasks that we could not otherwise solve from scratch due to extremely sparse rewards. The results suggest that RL with modulating masks is a promising approach to lifelong learning, to the composition of knowledge to learn increasingly complex tasks, and to knowledge reuse for efficient and faster learning.  ( 2 min )
    DAG Learning on the Permutahedron. (arXiv:2301.11898v1 [cs.LG])
    We propose a continuous optimization framework for discovering a latent directed acyclic graph (DAG) from observational data. Our approach optimizes over the polytope of permutation vectors, the so-called Permutahedron, to learn a topological ordering. Edges can be optimized jointly, or learned conditional on the ordering via a non-differentiable subroutine. Compared to existing continuous optimization approaches our formulation has a number of advantages including: 1. validity: optimizes over exact DAGs as opposed to other relaxations optimizing approximate DAGs; 2. modularity: accommodates any edge-optimization procedure, edge structural parameterization, and optimization loss; 3. end-to-end: either alternately iterates between node-ordering and edge-optimization, or optimizes them jointly. We demonstrate, on real-world data problems in protein-signaling and transcriptional network discovery, that our approach lies on the Pareto frontier of two key metrics, the SID and SHD.  ( 2 min )
    Aleatoric and Epistemic Discrimination in Classification. (arXiv:2301.11781v1 [cs.LG])
    Machine learning (ML) models can underperform on certain population groups due to choices made during model development and bias inherent in the data. We categorize sources of discrimination in the ML pipeline into two classes: aleatoric discrimination, which is inherent in the data distribution, and epistemic discrimination, which is due to decisions during model development. We quantify aleatoric discrimination by determining the performance limits of a model under fairness constraints, assuming perfect knowledge of the data distribution. We demonstrate how to characterize aleatoric discrimination by applying Blackwell's results on comparing statistical experiments. We then quantify epistemic discrimination as the gap between a model's accuracy given fairness constraints and the limit posed by aleatoric discrimination. We apply this approach to benchmark existing interventions and investigate fairness risks in data with missing values. Our results indicate that state-of-the-art fairness interventions are effective at removing epistemic discrimination. However, when data has missing values, there is still significant room for improvement in handling aleatoric discrimination.  ( 2 min )
    From Classification Accuracy to Proper Scoring Rules: Elicitability of Probabilistic Top List Predictions. (arXiv:2301.11797v1 [stat.ME])
    In the face of uncertainty, the need for probabilistic assessments has long been recognized in the literature on forecasting. In classification, however, comparative evaluation of classifiers often focuses on predictions specifying a single class through the use of simple accuracy measures, which disregard any probabilistic uncertainty quantification. I propose probabilistic top lists as a novel type of prediction in classification, which bridges the gap between single-class predictions and predictive distributions. The probabilistic top list functional is elicitable through the use of strictly consistent evaluation metrics. The proposed evaluation metrics are based on symmetric proper scoring rules and admit comparison of various types of predictions ranging from single-class point predictions to fully specified predictive distributions. The Brier score yields a metric that is particularly well suited for this kind of comparison.  ( 2 min )
    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing. (arXiv:2301.11500v1 [cs.LG])
    It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the greedy low-rank learning heuristics (Li et al., 2020) and follows an incremental learning procedure (Gissin et al., 2019): GD sequentially learns solutions with increasing ranks until it recovers the ground truth matrix. Compared to existing works which only analyze the first learning phase for rank-1 solutions, our result provides characterizations for the whole learning process. Moreover, besides the over-parameterized regime that many prior works focused on, our analysis of the incremental learning procedure also applies to the under-parameterized regime. Finally, we conduct numerical experiments to confirm our theoretical findings.  ( 2 min )
    DBGSL: Dynamic Brain Graph Structure Learning. (arXiv:2209.13513v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have shown success at learning representations of brain graphs derived from functional magnetic resonance imaging (fMRI) data. The majority of existing GNN methods, however, assume brain graphs are static over time and the graph adjacency matrix is known prior to model training. These assumptions are at odds with neuroscientific evidence that brain graphs are time-varying with a connectivity structure that depends on the choice of functional connectivity measure. Noisy brain graphs that do not truly represent the underling fMRI data can have a detrimental impact on the performance of GNNs. As a solution, we propose Dynamic Brain Graph Structure Learning (DBGSL), a novel method for learning the optimal time-varying dependency structure of fMRI data induced by a downstream prediction task. Experiments demonstrate DBGSL achieves state-of-the-art performance for sex classification using real-world resting-state and task fMRI data. Moreover, analysis of the learnt dynamic graphs highlights prediction-related brain regions which align with existing neuroscience literature.  ( 2 min )
    Neural Additive Models for Location Scale and Shape: A Framework for Interpretable Neural Regression Beyond the Mean. (arXiv:2301.11862v1 [stat.ML])
    Deep neural networks (DNNs) have proven to be highly effective in a variety of tasks, making them the go-to method for problems requiring high-level predictive power. Despite this success, the inner workings of DNNs are often not transparent, making them difficult to interpret or understand. This lack of interpretability has led to increased research on inherently interpretable neural networks in recent years. Models such as Neural Additive Models (NAMs) achieve visual interpretability through the combination of classical statistical methods with DNNs. However, these approaches only concentrate on mean response predictions, leaving out other properties of the response distribution of the underlying data. We propose Neural Additive Models for Location Scale and Shape (NAMLSS), a modelling framework that combines the predictive power of classical deep learning models with the inherent advantages of distributional regression while maintaining the interpretability of additive models.  ( 2 min )
    Myriad: a real-world testbed to bridge trajectory optimization and deep learning. (arXiv:2202.10600v2 [cs.LG] UPDATED)
    We present Myriad, a testbed written in JAX for learning and planning in real-world continuous environments. The primary contributions of Myriad are threefold. First, Myriad provides machine learning practitioners access to trajectory optimization techniques for application within a typical automatic differentiation workflow. Second, Myriad presents many real-world optimal control problems, ranging from biology to medicine to engineering, for use by the machine learning community. Formulated in continuous space and time, these environments retain some of the complexity of real-world systems often abstracted away by standard benchmarks. As such, Myriad strives to serve as a stepping stone towards application of modern machine learning techniques for impactful real-world tasks. Finally, we use the Myriad repository to showcase a novel approach for learning and control tasks. Trained in a fully end-to-end fashion, our model leverages an implicit planning module over neural ordinary differential equations, enabling simultaneous learning and planning with complex environment dynamics.  ( 2 min )
    Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits. (arXiv:2210.00025v2 [cs.LG] UPDATED)
    How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to computational and storage issues $\unicode{x2014}$ particularly salient in continuous action spaces. We propose Artificial Replay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. Artificial Replay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on $K$-armed and continuous combinatorial bandit algorithms, including a green security domain using real poaching data. We show the practical benefits of Artificial Replay, including for base algorithms that do not satisfy IIData.  ( 2 min )
    Overparameterized Linear Regression under Adversarial Attacks. (arXiv:2204.06274v2 [stat.ML] UPDATED)
    We study the error of linear regression in the face of adversarial attacks. In this framework, an adversary changes the input to the regression model in order to maximize the prediction error. We provide bounds on the prediction error in the presence of an adversary as a function of the parameter norm and the error in the absence of such an adversary. We show how these bounds make it possible to study the adversarial error using analysis from non-adversarial setups. The obtained results shed light on the robustness of overparameterized linear models to adversarial attacks. Adding features might be either a source of additional robustness or brittleness. On the one hand, we use asymptotic results to illustrate how double-descent curves can be obtained for the adversarial error. On the other hand, we derive conditions under which the adversarial error can grow to infinity as more features are added, while at the same time, the test error goes to zero. We show this behavior is caused by the fact that the norm of the parameter vector grows with the number of features. It is also established that $\ell_\infty$ and $\ell_2$-adversarial attacks might behave fundamentally differently due to how the $\ell_1$ and $\ell_2$-norms of random projections concentrate. We also show how our reformulation allows for solving adversarial training as a convex optimization problem. This fact is then exploited to establish similarities between adversarial training and parameter-shrinking methods and to study how the training might affect the robustness of the estimated models.  ( 2 min )
    ActiveLab: Active Learning with Re-Labeling by Multiple Annotators. (arXiv:2301.11856v1 [cs.LG])
    In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.  ( 2 min )
    Exponential tail bounds and Large Deviation Principle for Heavy-Tailed U-Statistics. (arXiv:2301.11563v1 [math.PR])
    We study deviation of U-statistics when samples have heavy-tailed distribution so the kernel of the U-statistic does not have bounded exponential moments at any positive point. We obtain an exponential upper bound for the tail of the U-statistics which clearly denotes two regions of tail decay, the first is a Gaussian decay and the second behaves like the tail of the kernel. For several common U-statistics, we also show the upper bound has the right rate of decay as well as sharp constants by obtaining rough logarithmic limits which in turn can be used to develop LDP for U-statistics. In spite of usual LDP results in the literature, processes we consider in this work have LDP speed slower than their sample size $n$.  ( 2 min )
    Leveraging the Third Dimension in Contrastive Learning. (arXiv:2301.11790v1 [cs.CV])
    Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.  ( 2 min )
    Variance, Self-Consistency, and Arbitrariness in Fair Classification. (arXiv:2301.11562v1 [cs.LG])
    In fair classification, it is common to train a model, and to compare and correct subgroup-specific error rates for disparities. However, even if a model's classification decisions satisfy a fairness metric, it is not necessarily the case that these decisions are equally confident. This becomes clear if we measure variance: We can fix everything in the learning process except the subset of training data, train multiple models, measure (dis)agreement in predictions for each test example, and interpret disagreement to mean that the learning process is more unstable with respect to its classification decision. Empirically, some decisions can in fact be so unstable that they are effectively arbitrary. To reduce this arbitrariness, we formalize a notion of self-consistency of a learning process, develop an ensembling algorithm that provably increases self-consistency, and empirically demonstrate its utility to often improve both fairness and accuracy. Further, our evaluation reveals a startling observation: Applying ensembling to common fair classification benchmarks can significantly reduce subgroup error rate disparities, without employing common pre-, in-, or post-processing fairness interventions. Taken together, our results indicate that variance, particularly on small datasets, can muddle the reliability of conclusions about fairness. One solution is to develop larger benchmark tasks. To this end, we release a toolkit that makes the Home Mortgage Disclosure Act datasets easily usable for future research.  ( 2 min )
    Causal Bandits without Graph Learning. (arXiv:2301.11401v1 [stat.ML])
    We study the causal bandit problem when the causal graph is unknown and develop an efficient algorithm for finding the parent node of the reward node using atomic interventions. We derive the exact equation for the expected number of interventions performed by the algorithm and show that under certain graphical conditions it could perform either logarithmically fast or, under more general assumptions, slower but still sublinearly in the number of variables. We formally show that our algorithm is optimal as it meets the universal lower bound we establish for any algorithm that performs atomic interventions. Finally, we extend our algorithm to the case when the reward node has multiple parents. Using this algorithm together with a standard algorithm from bandit literature leads to improved regret bounds.  ( 2 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v5 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Multi-dimensional concept discovery (MCD): A unifying framework with completeness guarantees. (arXiv:2301.11911v1 [cs.LG])
    The completeness axiom renders the explanation of a post-hoc XAI method only locally faithful to the model, i.e. for a single decision. For the trustworthy application of XAI, in particular for high-stake decisions, a more global model understanding is required. Recently, concept-based methods have been proposed, which are however not guaranteed to be bound to the actual model reasoning. To circumvent this problem, we propose Multi-dimensional Concept Discovery (MCD) as an extension of previous approaches that fulfills a completeness relation on the level of concepts. Our method starts from general linear subspaces as concepts and does neither require reinforcing concept interpretability nor re-training of model parts. We propose sparse subspace clustering to discover improved concepts and fully leverage the potential of multi-dimensional subspaces. MCD offers two complementary analysis tools for concepts in input space: (1) concept activation maps, that show where a concept is expressed within a sample, allowing for concept characterization through prototypical samples, and (2) concept relevance heatmaps, that decompose the model decision into concept contributions. Both tools together enable a detailed understanding of the model reasoning, which is guaranteed to relate to the model via a completeness relation. This paves the way towards more trustworthy concept-based XAI. We empirically demonstrate the superiority of MCD against more constrained concept definitions.  ( 2 min )
    Convolutional neural networks for valid and efficient causal inference. (arXiv:2301.11732v1 [stat.ML])
    Convolutional neural networks (CNN) have been successful in machine learning applications. Their success relies on their ability to consider space invariant local features. We consider the use of CNN to fit nuisance models in semiparametric estimation of the average causal effect of a treatment. In this setting, nuisance models are functions of pre-treatment covariates that need to be controlled for. In an application where we want to estimate the effect of early retirement on a health outcome, we propose to use CNN to control for time-structured covariates. Thus, CNN is used when fitting nuisance models explaining the treatment and the outcome. These fits are then combined into an augmented inverse probability weighting estimator yielding efficient and uniformly valid inference. Theoretically, we contribute by providing rates of convergence for CNN equipped with the rectified linear unit activation function and compare it to an existing result for feedforward neural networks. We also show when those rates guarantee uniformly valid inference. A Monte Carlo study is provided where the performance of the proposed estimator is evaluated and compared with other strategies. Finally, we give results on a study of the effect of early retirement on hospitalization using data covering the whole Swedish population.  ( 2 min )
    MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. (arXiv:2206.07697v2 [stat.ML] UPDATED)
    Creating fast and accurate force fields is a long-standing challenge in computational chemistry and materials science. Recently, several equivariant message passing neural networks (MPNNs) have been shown to outperform models built using other approaches in terms of accuracy. However, most MPNNs suffer from high computational cost and poor scalability. We propose that these limitations arise because MPNNs only pass two-body messages leading to a direct relationship between the number of layers and the expressivity of the network. In this work, we introduce MACE, a new equivariant MPNN model that uses higher body order messages. In particular, we show that using four-body messages reduces the required number of message passing iterations to just two, resulting in a fast and highly parallelizable model, reaching or exceeding state-of-the-art accuracy on the rMD17, 3BPA, and AcAc benchmark tasks. We also demonstrate that using higher order messages leads to an improved steepness of the learning curves.  ( 2 min )
    Distributionally Robust Multi-objective Bayesian Optimization under Uncertain Environments. (arXiv:2301.11588v1 [stat.ML])
    In this study, we address the problem of optimizing multi-output black-box functions under uncertain environments. We formulate this problem as the estimation of the uncertain Pareto-frontier (PF) of a multi-output Bayesian surrogate model with two types of variables: design variables and environmental variables. We consider this problem within the context of Bayesian optimization (BO) under uncertain environments, where the design variables are controllable, whereas the environmental variables are assumed to be random and not controllable. The challenge of this problem is to robustly estimate the PF when the distribution of the environmental variables is unknown, that is, to estimate the PF when the environmental variables are generated from the worst possible distribution. We propose a method for solving the BO problem by appropriately incorporating the uncertainties of the environmental variables and their probability distribution. We demonstrate that the proposed method can find an arbitrarily accurate PF with high probability in a finite number of iterations. We also evaluate the performance of the proposed method through numerical experiments.  ( 2 min )
    Distilling Importance Sampling for Likelihood Free Inference. (arXiv:1910.03632v6 [stat.CO] UPDATED)
    Likelihood-free inference involves inferring parameter values given observed data and a simulator model. The simulator is computer code which takes parameters, performs stochastic calculations, and outputs simulated data. In this work, we view the simulator as a function whose inputs are (1) the parameters and (2) a vector of pseudo-random draws. We attempt to infer all these inputs conditional on the observations. This is challenging as the resulting posterior can be high dimensional and involve strong dependence. We approximate the posterior using normalizing flows, a flexible parametric family of densities. Training data is generated by likelihood-free importance sampling with a large bandwidth value epsilon, which makes the target similar to the prior. The training data is "distilled" by using it to train an updated normalizing flow. The process is iterated, using the updated flow as the importance sampling proposal, and slowly reducing epsilon so the target becomes closer to the posterior. Unlike most other likelihood-free methods, we avoid the need to reduce data to low dimensional summary statistics, and hence can achieve more accurate results. We illustrate our method in two challenging examples, on queuing and epidemiology.  ( 2 min )
    Multilayer hypergraph clustering using the aggregate similarity matrix. (arXiv:2301.11657v1 [math.ST])
    We consider the community recovery problem on a multilayer variant of the hypergraph stochastic block model (HSBM). Each layer is associated with an independent realization of a d-uniform HSBM on N vertices. Given the aggregated number of hyperedges incident to each pair of vertices, represented using a similarity matrix, the goal is to obtain a partition of the N vertices into disjoint communities. In this work, we investigate a semidefinite programming (SDP) approach and obtain information-theoretic conditions on the model parameters that guarantee exact recovery both in the assortative and the disassortative cases.  ( 2 min )
    Machine Learning Approach and Extreme Value Theory to Correlated Stochastic Time Series with Application to Tree Ring Data. (arXiv:2301.11488v1 [stat.ML])
    The main goal of machine learning (ML) is to study and improve mathematical models which can be trained with data provided by the environment to infer the future and to make decisions without necessarily having complete knowledge of all influencing elements. In this work, we describe how ML can be a powerful tool in studying climate modeling. Tree ring growth was used as an implementation in different aspects, for example, studying the history of buildings and environment. By growing and via the time, a new layer of wood to beneath its bark by the tree. After years of growing, time series can be applied via a sequence of tree ring widths. The purpose of this paper is to use ML algorithms and Extreme Value Theory in order to analyse a set of tree ring widths data from nine trees growing in Nottinghamshire. Initially, we start by exploring the data through a variety of descriptive statistical approaches. Transforming data is important at this stage to find out any problem in modelling algorithm. We then use algorithm tuning and ensemble methods to improve the k-nearest neighbors (KNN) algorithm. A comparison between the developed method in this study ad other methods are applied. Also, extreme value of the dataset will be more investigated. The results of the analysis study show that the ML algorithms in the Random Forest method would give accurate results in the analysis of tree ring widths data from nine trees growing in Nottinghamshire with the lowest Root Mean Square Error value. Also, we notice that as the assumed ARMA model parameters increased, the probability of selecting the true model also increased. In terms of the Extreme Value Theory, the Weibull distribution would be a good choice to model tree ring data.  ( 2 min )
    LegendreTron: Uprising Proper Multiclass Loss Learning. (arXiv:2301.11695v1 [stat.ML])
    Loss functions serve as the foundation of supervised learning and are often chosen prior to model development. To avoid potentially ad hoc choices of losses, statistical decision theory describes a desirable property for losses known as \emph{properness}, which asserts that Bayes' rule is optimal. Recent works have sought to \emph{learn losses} and models jointly. Existing methods do this by fitting an inverse canonical link function which monotonically maps $\mathbb{R}$ to $[0,1]$ to estimate probabilities for binary problems. In this paper, we extend monotonicity to maps between $\mathbb{R}^{C-1}$ and the projected probability simplex $\tilde{\Delta}^{C-1}$ by using monotonicity of gradients of convex functions. We present {\sc LegendreTron} as a novel and practical method that jointly learns \emph{proper canonical losses} and probabilities for multiclass problems. Tested on a benchmark of domains with up to 1,000 classes, our experimental results show that our method consistently outperforms the natural multiclass baseline under a $t$-test at 99% significance on all datasets with greater than 10 classes.  ( 2 min )
    Rigid body flows for sampling molecular crystal structures. (arXiv:2301.11355v1 [cs.LG])
    Normalizing flows (NF) are a class of powerful generative models that have gained popularity in recent years due to their ability to model complex distributions with high flexibility and expressiveness. In this work, we introduce a new type of normalizing flow that is tailored for modeling positions and orientations of multiple objects in three-dimensional space, such as molecules in a crystal. Our approach is based on two key ideas: first, we define smooth and expressive flows on the group of unit quaternions, which allows us to capture the continuous rotational motion of rigid bodies; second, we use the double cover property of unit quaternions to define a proper density on the rotation group. This ensures that our model can be trained using standard likelihood-based methods or variational inference with respect to a thermodynamic target density. We evaluate the method by training Boltzmann generators for two molecular examples, namely the multi-modal density of a tetrahedral system in an external field and the ice XI phase in the TIP4P-Ew water model. Our flows can be combined with flows operating on the internal degrees of freedom of molecules, and constitute an important step towards the modeling of distributions of many interacting molecules.  ( 2 min )
    Optimally-Weighted Estimators of the Maximum Mean Discrepancy for Likelihood-Free Inference. (arXiv:2301.11674v1 [stat.ME])
    Likelihood-free inference methods typically make use of a distance between simulated and real data. A common example is the maximum mean discrepancy (MMD), which has previously been used for approximate Bayesian computation, minimum distance estimation, generalised Bayesian inference, and within the nonparametric learning framework. The MMD is commonly estimated at a root-$m$ rate, where $m$ is the number of simulated samples. This can lead to significant computational challenges since a large $m$ is required to obtain an accurate estimate, which is crucial for parameter estimation. In this paper, we propose a novel estimator for the MMD with significantly improved sample complexity. The estimator is particularly well suited for computationally expensive smooth simulators with low- to mid-dimensional inputs. This claim is supported through both theoretical results and an extensive simulation study on benchmark simulators.  ( 2 min )
    Neural networks learn to magnify areas near decision boundaries. (arXiv:2301.11375v1 [cs.LG])
    We study how training molds the Riemannian geometry induced by neural network feature maps. At infinite width, neural networks with random parameters induce highly symmetric metrics on input space. Feature learning in networks trained to perform classification tasks magnifies local areas along decision boundaries. These changes are consistent with previously proposed geometric approaches for hand-tuning of kernel methods to improve generalization.  ( 2 min )
    DBGDGM: Dynamic Brain Graph Deep Generative Model. (arXiv:2301.11408v1 [cs.LG])
    Graphs are a natural representation of brain activity derived from functional magnetic imaging (fMRI) data. It is well known that clusters of anatomical brain regions, known as functional connectivity networks (FCNs), encode temporal relationships which can serve as useful biomarkers for understanding brain function and dysfunction. Previous works, however, ignore the temporal dynamics of the brain and focus on static graphs. In this paper, we propose a dynamic brain graph deep generative model (DBGDGM) which simultaneously clusters brain regions into temporally evolving communities and learns dynamic unsupervised node embeddings. Specifically, DBGDGM represents brain graph nodes as embeddings sampled from a distribution over communities that evolve over time. We parameterise this community distribution using neural networks that learn from subject and node embeddings as well as past community assignments. Experiments demonstrate DBGDGM outperforms baselines in graph generation, dynamic link prediction, and is comparable for graph classification. Finally, an analysis of the learnt community distributions reveals overlap with known FCNs reported in neuroscience literature.  ( 2 min )
    Estimating Causal Effects using a Multi-task Deep Ensemble. (arXiv:2301.11351v1 [cs.LG])
    Over the past few decades, a number of methods have been proposed for causal effect estimation, yet few have been demonstrated to be effective in handling data with complex structures, such as images. To fill this gap, we propose a Causal Multi-task Deep Ensemble (CMDE) framework to learn both shared and group-specific information from the study population and prove its equivalence to a multi-task Gaussian process (GP) with coregionalization kernel a priori. Compared to multi-task GP, CMDE efficiently handles high-dimensional and multi-modal covariates and provides pointwise uncertainty estimates of causal effects. We evaluate our method across various types of datasets and tasks and find that CMDE outperforms state-of-the-art methods on a majority of these tasks.  ( 2 min )
    Robust variance-regularized risk minimization with concomitant scaling. (arXiv:2301.11584v1 [stat.ML])
    Under losses which are potentially heavy-tailed, we consider the task of minimizing sums of the loss mean and standard deviation, without trying to accurately estimate the variance. By modifying a technique for variance-free robust mean estimation to fit our problem setting, we derive a simple learning procedure which can be easily combined with standard gradient-based solvers to be used in traditional machine learning workflows. Empirically, we verify that our proposed approach, despite its simplicity, performs as well or better than even the best-performing candidates derived from alternative criteria such as CVaR or DRO risks on a variety of datasets.  ( 2 min )

  • Open

    College classes you guys recommend, to help me build and use neural networks.
    For some context, I plan on eventually majoring in neuroscience. I am enrolling in a local state college soon. I want to get my associates out of the way, and then transferring to a college with a good neuroscience program. It seems like neural networks, are an exciting avenue for research. There's a lot more neuroscience related research being published using neural networks. What courses would you guys recommend me enrolling in? I feel like the obvious one is software development, but maybe there are some others as well. I'd appreciate any insight, thanks! submitted by /u/daddydilly694-20 [link] [comments]  ( 41 min )
  • Open

    Weekly China AI News: China's Master Plan for Robots; Robots Transform into Liquid to Escape Jail; Li Auto's Goal to Become AI Leader by 2030
    submitted by /u/trcytony [link] [comments]  ( 40 min )
    A Video Made using CHATGPT + MidJourney and Text to Speech
    Hey All! I wrote this story and would appreciate your support on my channel and reviews. You can also give me stories for the channel. Thank You for your time :) https://youtu.be/-aM7cSbFFFY The sky was a deep shade of black as the group of friends made their way to the cabin, the only light coming from the beams of their headlights piercing through the heavy rain. The cabin, located deep in the woods, was said to be haunted, but the friends brushed off the rumors as nothing more than tales meant to scare. As they settled in for the night, they laughed and joked, telling ghost stories and watching horror movies. They passed jokes about how they would run if they saw a ghost, unaware of the danger lurking outside. But their laughter was cut short when Dave stumbled across a letter in the…  ( 46 min )
    AI Dream 150 - MINDBLOW MONDAY - AI Video - FINAL MASTERPIECE
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    The Year of AI Breakthroughs 2022
    submitted by /u/pmz [link] [comments]  ( 40 min )
    What media format will ChatGPT and AI bring back that was previously obsolete?
    I'v been thinking a lot about Marshall McLuhan and his 4 laws of media. Specifically, the one that states that all new forms of media cause something to be retrieved from the past. What will ChatGPT and AI revive and retrieve? I put some more thoughts in my blog. Would love to hear your thoughts on it. https://bobhutchins.substack.com/p/what-media-format-will-chatgpt-and submitted by /u/Interesting_Status64 [link] [comments]  ( 41 min )
    Easy-To-Use Voice Models for Fun Round of Trivia
    I'm putting together a trivia night, and one idea I have for a round is "The Pen*s Game" where I replace a word from a famous movie quote with the word "pen*s". I could simply replace it with my own voice, but I don't think it would sound as good/be as funny, so I think it would be amazing to instead use a tool (presumably AI-powered) to achieve this. Is there a reasonably user-friendly way of doing this?Ideally it would be some magical tool that has a nice GUI and celebrity voices already trained, though I doubt this exists... I am well-versed in Python, and am always looking to learn new tools, but I don't have the time to scrape voice data for each actor and train a model (not that I even know where I'd start on that front). As an example, imagine this clip from Shrek except "parfait" is replaced with "pen*s". https://youtu.be/-FtCTW2rVFM?t=86 I don't know this threads rules, so I censor "pen*s" just in case. Any suggestions would be appreciated! Thank you. submitted by /u/wendeborn8 [link] [comments]  ( 41 min )
    AI APIs Implementation - Stack and recommendations
    Hi everyone ! I had this weird idea lately, with the rise of AI to the mainstream, to implement some kind of "Simulator", where one could input a webcam picture, some basic info through a form, & get an alternate small narrated video narrated of one's future. I'm looking for suggestions for open source AI APIs I could use for it. My idea, very simplistically put, is to combine them through input, outputs and some basic logic manipulation. I'm also looking for recommendations for the stack I could use. I've worked with NodeJS and Java w/ Spring Boot on backend, & Angular for frontend. Any recommendations based in the APIs? submitted by /u/WhereIsBryan [link] [comments]  ( 41 min )
    RecolorNeRF is like a basic Photoshop for NeRFs
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Google's AI Tool "MusicLM" creates Music based on text descriptions
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    Which AI chat apps can have phone conversations with me?
    Hi. I tried to voice chat with Kuki on Telegram but she still doesn't have this ability. Replika has voice calls. Do you know other AI apps that can already speak using voice? submitted by /u/Trainer_Red99 [link] [comments]  ( 40 min )
    You against the machine: Can you spot which art was created by A.I.?
    submitted by /u/robbinpetertopaypaul [link] [comments]  ( 41 min )
    📌[Searchcolab] "8K Nature" Links in comment
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    With all of this heated arguing over AI, It's time for a more realistic, balanced analysis. As a filmmaker/writer and tech entrepreneur, here's a perspective that hardly anyone is considering and perhaps one that could help bridge the gap between the haters and lovers of AI.
    submitted by /u/CyborgWriter [link] [comments]  ( 41 min )
    📌[Searchcolab] Generative AI is climbing the *Dimensional Ladder*. I made a figure to show the milestones!
    1D: MusicLM, VALL-E 2D: Stable Diffusion, DALL-E, MidJourney 3D (or 2+1D): Imagen-video, Phenaki 3D: Magic3D, DreamFusion, Point-E 4D (or 3+1D): Make-A-Video-3D [Searchcolab] What’s next? 🤔 https://reddit.com/link/10p6vw9/video/gqbnrsaxh7fa1/player submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Is there a AI that you can feed books into, so it learns, then communicates using that info?
    Similar to chatgpt, etc. but it doesn't have all the knowledge, but just from a very special field. E.g. if you would like to create an AI for kids, you feed 1000 kids books into it, and it would respond exclusively in a child-friendly or goofy way. It's simply not possible that it responds in any other way, because it doesn't know about the rest of the world. submitted by /u/EndlessSenseless [link] [comments]  ( 41 min )
    AI Has Successfully Imitated Human Evolution—and Might Do It Even Better
    submitted by /u/Itchy0101 [link] [comments]  ( 40 min )
    Google AI music samples/copywrite problems.
    https://ainewsbase.com/google-musiclm-copyright-issues-not-releasing/ The samples they do show might just sound weird because of the stored file or whatever but the sound definitely sounds kinda weird. submitted by /u/SPEEDYFISHY2000 [link] [comments]  ( 40 min )
    3 Books About Artificial Intelligence Everyone Should Read
    I spend about 5 hours a day reading about AI and the news is a lot to keep up with - however there's a lot of big picture AI information that is a good foundation for understanding the future implications of AI. I think these three books offer a good variety of perspective from the philosophical, business, and strategy side of AI. Philosophy: Superintelligence by Nick Bostrom - This S-tier book explores the potential risks and benefits of creating artificial intelligence beyond human levels. The book argues that if the development of superintelligence proceeds rapidly, it could pose an existential threat to humanity. Bostrom paints an incredibly articulate perspective on why we must consider the possibility of creating AI in a way that aligns with human values, and that we must prepare …  ( 44 min )
    Report Says AI Could Potentially Replace 85 Million Jobs Worldwide By 2025 — Are Interns On The List?
    submitted by /u/Mental_Character7367 [link] [comments]  ( 41 min )
    New Artificial Intelligence Subreddit?
    I suggest the need for a new subreddit, where people can announce their creations. You know: "I've created an AI to make s'mores..." "I've created an AI to sort my comic books..." "I've created an AI to find me a girlfriend..." ...ad nauseum. submitted by /u/PredictorX1 [link] [comments]  ( 40 min )
    AI and construction industry
    Hello guys, Have you found any application of AI in the construction industry (notwithstanding the design/modelization)? I've been following the industry for ages and I believe there is so much to be done particularly with AI as it can manage dependancies.. I've quite a few ideas too; get in touch if you are interested to discuss over this :) submitted by /u/MexsEU [link] [comments]  ( 41 min )
    “Nothing Forever”, — AI-generated, always streaming parody of ‘90s sitcoms
    submitted by /u/tinylobsta [link] [comments]  ( 41 min )
    ChatGPT Surpasses Instagram With 10 Million Daily Users In Just 40 Days
    submitted by /u/liquidocelotYT [link] [comments]  ( 42 min )
    Vector animals bundle
    submitted by /u/annal201 [link] [comments]  ( 40 min )
    AI is becoming a commodity
    submitted by /u/_utisz_ [link] [comments]  ( 40 min )
  • Open

    [D] Is the YoloR paper worth looking into?
    Doing a survey of object detection papers with plausible application to pose-estimation tasks. Came across the paper "You Only Learn One Representation" and, while the theory seems interesting, I want to hear people's opinions before doing a deep dive into the theory. submitted by /u/answersareallyouneed [link] [comments]  ( 42 min )
    [D] Towards A Token-Free Future In NLP
    https://peltarion.com/blog/data-science/towards-a-token-free-future-in-nlp submitted by /u/EducationalCicada [link] [comments]  ( 42 min )
    [P] I launched “CatchGPT”, a supervised model trained with millions of text examples, to detect GPT created content
    I’m an ML Engineer at Hive AI and I’ve been working on a ChatGPT Detector. Here is a free demo we have up: https://hivemoderation.com/ai-generated-content-detection From our benchmarks it’s significantly better than similar solutions like GPTZero and OpenAI’s GPT2 Output Detector. On our internal datasets, we’re seeing balanced accuracies of >99% for our own model compared to around 60% for GPTZero and 84% for OpenAI’s GPT2 Detector. Feel free to try it out and let us know if you have any feedback! submitted by /u/qthai912 [link] [comments]  ( 56 min )
    [D] Are there neural net plugins to assist audio editing of Youtube screencasts?
    In order to improve my talking skills, I am doing a little series on how to setup Stable Diffusion on Paperspace, and I am astounded how much time it takes to do the audio editing. Well, part of the reason is that I've only been doing this for 3 days and my process is very inefficient, but it feels that in the current time, neural nets should be able to do things like remove uhms, lip smacking and breath intakes. I've looked around, and this post from 9 years ago says the only choice is to edit it by hand. Is that still true? submitted by /u/abstractcontrol [link] [comments]  ( 43 min )
    [D] What's stopping you from working on speech and voice?
    I've been working in the speech and voice space for a while now and am now building out some tooling in the space to make it easier for researchers/engineers/developers to build speech processing systems and features; I'd love to hear what people in ML struggle with when you're trying to build or work with speech processing for your projects/products (beyond speech-to-text APIs) submitted by /u/jiamengial [link] [comments]  ( 47 min )
    [D] DL university research PC suggestions?
    I am a researcher at a US university and have a budget of 25k to build a PC for training various ML algorithms (e.g. DRL, neuromorphic computing, VAE, etc). I'm trying to decide between going for prebuilds (like https://lambdalabs.com/gpu-workstations/vector) or building with consumer cards like 4090s. Any advice on which is the most bang for the price? Im not sure how much Im giving up by going for consumer 24g cards vs a6000, 6000 ada but prebuild prices go up quick. Warrantee vs building it myself isn't an issue submitted by /u/seanrescs [link] [comments]  ( 44 min )
    [R] Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models - Stanford University Eric Zelikman et al - Beats prior code generation sota by over 75%!
    Paper: https://arxiv.org/abs/2212.10561 Github: https://github.com/ezelikman/parsel Twitter: https://twitter.com/ericzelikman/status/1618426056163356675?s=20 Website: https://zelikman.me/parselpaper/ Code Generation on APPS Leaderboard: https://paperswithcode.com/sota/code-generation-on-apps Abstract: Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be …  ( 45 min )
    [P] Keras model production deployment
    Hi guys. It's been some time since I started developing my Keras models, but now is the first time I am trying to push it to production. My Keras model looks like this: model = Sequential() model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(TimeDistributed(Dense(1, activation='sigmoid'))) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) My problem is I need to run through about 25 of these for every written sentence. There is going to be an online editor, where users can paste text for my analysis. That means up to about 300 words or about 20 sentences at once. With the current time to run each network (about 0.2s), that means 25 * 0,2 * 20 or about 100s per user input. I am going for 30 seconds at most with potentially dozens of users at once. Ideally on a Raspberry Pi 4. The internet is surely gonna back me up I thought to myself and started googling. If only I know what kind of a rabbit hole I was about to fall into. First I converted my Keras model into a TensorFlow frozen graph model. 10x time improvement on CPU, but still at 0.2s on average. Another thing I think may boost the performance is retraining the models for variable input shape (currently I always feed in 50 values). With the average sentence size of 16 words this may, from what I understand, lead to a 3 times boost? My question is: now what? What can I do to make it faster? Is it even possible to run it on a Raspberry Pi 4 and get reasonable response times? If not, what is my best option on a tight budget? submitted by /u/ProfessionalOverall8 [link] [comments]  ( 45 min )
    [D] I want to understand the broad steps for building something like Adept.AI
    From the given link!, I gather that it is a large-scale Transformer trained to use digital tools like a web browser. Right now, it’s hooked up to a Chrome extension which allows it to observe what’s happening in the browser and take certain actions, like clicking, typing, and scrolling, etc. I am interested in knowing the broad steps involved in building something like this. submitted by /u/smred123 [link] [comments]  ( 43 min )
    [Discussion] ChatGPT and language understanding benchmarks
    The general consensus seems to be that large language models, and ChatGPT in particular, have a problem with accuracy and hallucination. As compared to what, is often unclear, but let's say as compared to other NLP methods of question answering, language understanding or as compared to Google Search. I haven't really been able to find any reliable sources documenting this accuracy problem, though. The SuperGLUE benchmark has GPT-3 ranked #24, not terrible, but outperformed by old models like T5, which seems odd. GLUE nothing. SQUAD nothing. So, I'm curious: Is there any benchmark or metric reflecting the seeming step-function made by ChatGPT that's got everyone so excited? I definitely feel like there's a difference between gpt-3 and chatGPT, but is it measurable or is it just vibes? Is there any metric showing ChatGPT's problem with fact hallucination and accuracy? Am I off the mark here looking at question-answering benchmarks as an assessment of LLMs? Thanks submitted by /u/mettle [link] [comments]  ( 46 min )
    [D]Are There Studies on text-davinci-003's Zero/Few-shot Performance on Various Academic Benchmarks?
    Has anyone come across studies on GPT3 text-davinci-003's zero/few-shot performance over various NLP benchmarks and how they compare to current SoTA? E.g GLUE, SuperGLUE and over more classic ones like CoNLL 2003 NER. I thought it would be pretty interesting to see how far zero/few-shot learning with LLM has progressed with RLHF and instruction tuning. Am surprised that nobody has done such a benchmark yet. submitted by /u/gamerx88 [link] [comments]  ( 42 min )
    [D] Sparse Ridge Regression
    Hi all! Given X ∈ ℝ Nx, Y ∈ ℝ Ny, β ∈ ℝ+, so W = YXT(XXT+βI)-1 (with the Moore–Penrose pseudoinverse) where A = YXT and B = XXT+βI. If we consider an arbitrary number of indices/units < Nx, and so we consider only some columns of matrix A and some columns and rows (crosses) of B. The rest of A and B are zeros. The approach above of sparsify A and B will break the ridge regression solution when W=AB-1? If yes, there are ways to avoid it? Many thanks! submitted by /u/antodima [link] [comments]  ( 43 min )
    [R] A Robust Hypothesis Test for Tree Ensemble Pruning
    I'm looking for help/feedback with this paper. Please let me know if the method is interesting and if there's ways to improve it! https://arxiv.org/abs/2301.10115 Abstract: Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms. submitted by /u/asi_dm [link] [comments]  ( 43 min )
    [R] Train CIFAR10 in under 10 seconds on an A100 (new world record!)
    https://github.com/tysam-code/hlb-CIFAR10 submitted by /u/tysam_and_co [link] [comments]  ( 53 min )
  • Open

    Top Android App Development Trends in 2023
    The global mobile app development revenue is $526 billion, making it one of the most flourishing industries worldwide. Android controls 73% of the market share. So, if you have plans to build an Android app, there’s no better time. However, to ensure your Android app stands out, you must keep an eye on the latest… Read More »Top Android App Development Trends in 2023 The post Top Android App Development Trends in 2023 appeared first on Data Science Central.  ( 22 min )
    Enabling contextual computing in today’s enterprise information fabrics
    During the 1970s, Ethernet pioneer and 3Com Internet equipment company founder Bob Metcalfe was working on something called the “Data Reconfiguration Service” for the early Internet. “It was an effort to write a special purpose programming language to convert data formats, Metcalfe said during a 2021 OriginTrail.io panel session. “And the goal was so that… Read More »Enabling contextual computing in today’s enterprise information fabrics The post Enabling contextual computing in today’s enterprise information fabrics appeared first on Data Science Central.  ( 21 min )
  • Open

    Amazon SageMaker built-in LightGBM now offers distributed training using Dask
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, […]  ( 12 min )
    Build a water consumption forecasting solution for a water utility agency using Amazon Forecast
    Amazon Forecast is a fully managed service that uses machine learning (ML) to generate highly accurate forecasts, without requiring any prior ML experience. Forecast is applicable in a wide variety of use cases, including estimating supply and demand for inventory management, travel demand forecasting, workforce planning, and computing cloud infrastructure usage. You can use Forecast […]  ( 10 min )
  • Open

    Good autocomplete
    I’m not sure whether automatic text completion on a mobile device is a net good. It sometimes saves a few taps, but it seems like it’s at least as likely to cause extra work. Although I’m ambivalent about autocomplete on my phone, I like it in my text editor. The difference is that in my […] Good autocomplete first appeared on John D. Cook.  ( 6 min )
  • Open

    Best recurrent RL library?
    Does anyone know which library has the best support for recurrent RL algorithms? It seems like many implement recurrent PPO and maybe one other recurrent algorithm and that's it. Most implementations treat recurrence as an afterthought, making it buggy and hard to extend. I'd like a library with first-class recurrent support for: DQN/Rainbow SAC PPO etc. Is anyone familiar with such a library? submitted by /u/smorad [link] [comments]  ( 42 min )
    Generative Meta-Learning for Robust Quality-Diversity Ensemble under Stochastic Rewards
    Gen-Meta is a learning-to-learn method for evolutionary illumination that is competitive against SotA methods in Nevergrad, with a much superior scalability for large-scale optimization. The key to out-of-sample robustness in portfolio optimization is quality-diversity optimization, where one aims to obtain multiple diverse solutions of high quality, rather than one. Generative meta-learning is the only portfolio optimization method that performs QD optimization to obtain a robust ensemble portfolio consisting of several de-correlated sub-portfolios. In the below image, the red line is the index to be tracked, and the blue line is the sparse portfolio ensembled from a thousand behaviorally-diverse sub-portfolios co-optimized (other lines). ​ Red Line: Tracked Index, Blue Line: Sparse Ensemble, Others: Diverse Subportfolios In Gen-Meta portfolio optimization, a Monte-Carlo optimization is performed over those portfolio candidates to reward each individual separately in randomly selected historical periods. To further optimize the portfolio robustness, the portfolio weights of the candidates are heavily corrupted first by adding noise and then dropping out the vast majority of their weights. I previously open-sourced the application of Gen-Meta in sparse index-tracking. Hence, I invite you to do your ablation study to see how each technique affects the out-of-sample robustness. The following repository includes comments on those critical techniques performed to obtain a robust ensemble from behaviorally-diverse high-quality portfolios co-optimized with Gen-Meta. The codes for Gen-Meta in sparse index-tracking The comparison in-between Gen-Meta & Nevergrad submitted by /u/k_yuksel [link] [comments]  ( 42 min )

  • Open

    We built a browser extension that unlocks browser mode capabilities using ChatGPT: MULTI·ON: AI Web Co-Pilot powered by ChatGPT
    submitted by /u/DragonLord9 [link] [comments]  ( 41 min )
    AI Dream 150 - MY HEAVENLY DREAM BY AI - Part6 TEASER - AI Video vqgan ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Just published my new edu-tainment video on A.I. Tried to make it funny but also drop some mind-blasting facts!
    Things the video covers: What is intelligence? What is A.I.? What is the best currently available and what are the benefits? How does it work? What are the downsides? The increasing speed of human technological advancement Why A.I. actually terrifies me! (Some scenarios) I hope you enjoy it! submitted by /u/casualbob_uk [link] [comments]  ( 41 min )
    AI will drastically reduce developer jobs.
    Farmers still exist today but they exist in drastically fewer numbers than two centuries ago. The modern farming machinery and techniques did not replace farmers but made the industry much less labor intensive. Nowadays programming is a labor intensive activity with relative high salaries. AI is introducing the possibility to do this activity, that worldwide cost companies billions of dollars in programmers salaries, much more efficiently. In my opinion, this is the goal of companies like OpenAI. They know that they can’t remove humans out of the loop because current AI is not able to substitute all human cognitive capabilities that intervine in a software developer daily job; like talking with the client, figuring out what he wants and translating it to functional requirements. But nonetheless, they think they have a clear shot to make programming a non labor intensive activity like farming is today. Of course, this is a compelling multibillion business opportunity that is attracting increasing capital from the tech and the financial sectors. submitted by /u/masterile [link] [comments]  ( 43 min )
    Found this list of AI tools. It was nice discovering some ai video editing tools i have not heard before.
    submitted by /u/lshic [link] [comments]  ( 40 min )
    Elon Musk Say AI Will be Able to Simulate Conciousness, for me thats very difficult to happen but idk
    https://youtu.be/Y6gXZ61NnOE submitted by /u/sigmabruuh [link] [comments]  ( 42 min )
    Best image generators for graphic Novel?
    I'm trying to see if I can use an image generator to illustrate a story I am writing. The thing is the story itself is action-packed and dramatic so I thought I could illustrate it instead as a graphic novel generated by AI but I am worried about the consistency an quality of the images. Also, I can't really use passages in the story as a prompt because the generators don't seem to illustrate the scene well. Any suggestions? submitted by /u/swagonflyyyy [link] [comments]  ( 41 min )
    📌[Searchcolab] Site to share and find Colab Notebooks, CKPT and SafeTensors
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    What ChatGPT Could Mean for the Metaverse
    submitted by /u/Zurevu [link] [comments]  ( 40 min )
    AI (GPT) where you can ask data questions in English and automatically generate the answer - as if you have your own personal automated data analyst
    submitted by /u/lfogliantis [link] [comments]  ( 46 min )
    Big Tech was moving cautiously on AI. Then came ChatGPT.
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    AI Dream 150 - COSMIC STRUCTURES Part5 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    📌[Searchcolab] Disco Diffusion v5.6. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Using Ai to CLONE my Voice!
    https://www.youtube.com/watch?v=4RUQ_vZ-og0 This Ai can instantly clone your voice and sound exactly like you! It's both amazing and terrifying. The Ai is called ElevenLabs - it has features like text to speech and voice cloning! submitted by /u/Peter3tv33 [link] [comments]  ( 41 min )
    Google AI can create music in any genre from a text description
    submitted by /u/Mental_Character7367 [link] [comments]  ( 40 min )
    Knight Rider game. Midjourney, ChatGPT, Figma.
    submitted by /u/sidianmsjones [link] [comments]  ( 40 min )
    AI generated music video
    submitted by /u/aquin1313 [link] [comments]  ( 40 min )
    What if Patrick Bateman was a Data Scientist? An AI-Generated Video
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
  • Open

    Why did the original ResNet paper not use dropout?
    The ResNet paper by Kaiming He et al. does not use dropout for the models. A lot of models prior to ResNets, such as AlexNet and VGGNet gained from using dropout. Why did the authors choose not to use dropout for ResNets ? Is it because they use L2 regularization(weight decay) and batch normalization which are forms of regularization which can substitute dropout regularization ? submitted by /u/V1bicycle [link] [comments]  ( 41 min )
    Answer this please
    so, I have been learning what DL is and how NN learns to do stuff. From what I understand is the repeated iteration will take random weights and at some point those weights will be kinda perfect for the given task (plz correct me if i'm wrong) Ok, so lets take an example of a task like path finding AI, so we make a NN and train it to go from point A to point B, now it is trained and doing nice and goes to point b perfectly, SO here the weights are set to go from point A to point B right? What if we give the point B somewhere else, How will the AI get perfect weights as the current weights are only perfect for current point B What if we put an obstacle in between point A and B, how will the NN set weights, or is it something like a range of weights which are perfect for any given task for NN ​ IDK if I explained it right, plz comment if you have question about my question, and answer also💕 submitted by /u/Severe-Improvement32 [link] [comments]  ( 42 min )
  • Open

    [R] Industrial Case Study of GNNs with PyTorch Geometric for Document Understanding
    submitted by /u/how-it-is- [link] [comments]  ( 42 min )
    [R] Incorrect Ranking of Vessel Segmentation Algorithms
    In a recent article, we reviewed dozens of image segmentation algorithms and pointed out mathematically that in many cases the reported performance scores could not be the results of the evaluation methods claimed by the authors. The scores are primary indicators of value and serve as measures of the state-of-the-art to be outperformed by new algorithms. Unfortunately, algorithm rankings turned out to be incorrect in many of 100 papers and the problem is systematic. The pressure to outperform flawed performance scores to get published keeps the trend on-going. How should the community deal with a phenomenon like this: flaws uncovered, factual, undeniable yet on-going? Is the 258th algorithm proposed for a problem more valuable than reproducing a highly cited article? Should it be mandatory to share source code? Is there a merit in developing consistency checks like the ones we did? Any comments are welcome! https://arxiv.org/abs/2111.03853 submitted by /u/AttilaFazekas [link] [comments]  ( 43 min )
    [D] Remote PhD
    Hi all, During the pandemic many software companies transitioned their workforce to "fully-remote" or "partially-remote"; therefore, I was wondering if any reputable institutions offer a remote CS PhD? For context, I know of several individuals who have sorted out remote work with their PIs on a per-person basis (typically after the first 1-2 years of study), but I am not aware of any labs or programs that advertise remote study. Thank you in advance for the responses. Cheers, Matt submitted by /u/TheRealMrMatt [link] [comments]  ( 45 min )
    [N][R] Compiling and running GLM-130B on a local machine (4x 3090s, int4 quantization) - Author: Alex J. Champandard
    Twitter link to his post: https://twitter.com/alexjc/status/1617152800571416577?s=46&t=CMQT9rK4F1Lt7g7aX2vTJA also important in that regard: The case for 4-bit precision: k-bit Inference Scaling Laws - Tim Dettmers Paper: https://arxiv.org/abs/2212.09720 https://preview.redd.it/7nn0pfhn81fa1.jpg?width=585&format=pjpg&auto=webp&s=2d05998c32fb1eacf56c45e830047381d544f51f https://preview.redd.it/0084vhhn81fa1.jpg?width=598&format=pjpg&auto=webp&s=c9512275714964faa312e8fb2d96ab8ded7dd127 submitted by /u/Singularian2501 [link] [comments]  ( 42 min )
    [P] AI Content Detector
    Have you tried ChatGPT? It's super cool but some users are also using it to create automated content submissions and resulting in an increase in AI-generated plagiarism. I have made a tool as a college project to detect content generated using AI. Go ahead and validate your content on AI Content Detector If you are an educator worried about automated content submissions or developers worried about search engine penalties, this tool will help everyone to efficiently detect content generated using AI. submitted by /u/YoutubeStruggle [link] [comments]  ( 44 min )
    [D] what is roughly the cost of human-annotation vs compute to adapt a LLM?
    Let's say I pull a pre-trained LLM off of huggingface. In broad strokes (making whatever assumptions appropriate), what is the relative cost of getting human annotation data versus actually incorporating those data in through training? I've been trying to get this stats and so far the ratio seems to be 2:1, meaning if you spent 10k dollars collecting human annotations, you should expect to spend 5k on compute (finetune, RLHF, ect) but I'd be happy if someone with more experience can chime in. submitted by /u/evanthebouncy [link] [comments]  ( 42 min )
    [D] AI Theory - Signal Processing?
    On This page of Meta AI research where they mention AI theory as a topic, they mention that they use techniques from Signal Processing. As someone with an Electrical Engineering background, and interests in Mathematics and AI, I found this very intriguing. Can someone tell me some of the ways signal processing has been used in AI theory? Some papers or some work done? submitted by /u/a_khalid1999 [link] [comments]  ( 47 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 43 min )
    [P] Automating a Youtube Shorts channel with Huggingface Transformers and After Effects
    I’ll try to get into detail about the implementation and difficulties in case it is useful for anyone else trying to do something similar with an applied ML project, so there’s a TLDR at the end if you’d like the short version/result. At the end of last year I convinced myself to start 2023 by creating a side-project that I'd actually finish and deploy and perhaps earn some “passive” income (spoiler, not so passive after all :P), and after some brainstorming I settled on making an automated Youtube channel about finance news since I had just gotten into investing. Shorts seemed to be more manageable and monetization is changing in February so I went with that. My rough initial idea was to get online articles, summarize them, make a basic compilation with some combination of pymovie, open…  ( 47 min )
    [D] GPT-Index vs Langchain
    Someone I work with wrote the below for our internal team (shared with permission) and I thought some here may find it helpful. Recently, I built an app that uses GPT-Index & LangChain to provide an answer to a question based on a piece of text as context. I found GPT-Index to be much easier and straightforward to integrate, but it seems like LangChain has more features and is more powerful. Here's my experience integrating both of them. GPT-Index First thing I did was review their docs to make sure I understood what GPT-Index was, what it could do, and how I was going to use it I went back and forth a couple times figuring out how I was going to use it. Then I found the quickstart guide It seemed like the quickstart guide would work so I followed the guide and after a few tries, …  ( 45 min )
    [P] Targeted Summarization - A tool for information extraction
    submitted by /u/helliun [link] [comments]  ( 44 min )
    [D] How do people keep up with ML news that is not NLP related?
    Lately, NLP is taking up most of the public space, much of AI news is focused on LLM after Chat-GPT took the spotlight. How do non-NLP people keep up with news? I recently saw a post on reddit where tree models are still being improved. There are other topics too, like the recent trend in Model Explainability which feels to have slowed down. I'd guess this all gets into the more categorical questions which I am wrapping up with 'How do YOU get your ML news'? How does information gathering differ between those in Applied ML and AI researchers (or even further, between those in Business Analytics and those in more 'AI' fields) What sort of interesting things are out there in the world of ML now? (model or non-model related) In looking for Use Cases, does this partially come down to your field? (Finance reads finance news, pharma reads pharma news) ​ Many of the AI/ML Newsletters which I subscribed to when I was less experienced seemed to be full of variety, but as they are all converging to NLP recently maybe it is time to cleanse the subscriptions, or find some new resources. submitted by /u/shaner92 [link] [comments]  ( 44 min )
    [R] InstructPix2Pix: Learning to Follow Image Editing Instructions
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 45 min )
    [P] We Built an ML Search Engine that can find exact timestamps for anything on Youtube using OpenAI Whisper and UKPLab's SBERT Sentence Transformers
    submitted by /u/tomiwa1a [link] [comments]  ( 44 min )
    [Discussion] Stable Diffusion Models with Subject/Keyword References
    I was looking for research that centers around Stable Diffusion models but can be trained with seed images of a specific subject, so that if someone refers to a keyword like "Me" or "I" it would then produce images relative to the keyword of interest. Something like "I am swimming in a beautiful ocean with mountains in the background and wearing a speedo". Then the person subject in the photo would be myself since I referenced "I". submitted by /u/sambrojangles [link] [comments]  ( 43 min )
  • Open

    Small-scale automation
    Saving keystrokes is overrated, but maintaining concentration is underrated. This post is going to look at automating small tasks in order to maintain concentration, not to save time. If a script lets you easily carry out some ancillary task without taking your concentration off your main task, that’s a big win. Maybe the script only […] Small-scale automation first appeared on John D. Cook.  ( 5 min )
  • Open

    how to do when the simulation fails
    Hi, When my agent succeeds to achieve the goal, it receives done = True. But, when it fails in the simulation, for example, it crashes onto the obstacle, should I end the episode and set the done to "True" or is it okay if I just give it to penalty like reward -= 5 at the specific steps? Thanks submitted by /u/sonlightinn [link] [comments]  ( 42 min )

  • Open

    InstructPix2Pix Video: One Vase, Many Artists
    submitted by /u/anitakirkovska [link] [comments]  ( 40 min )
    Make-a-Video3D: Meta generates 3D scenes from text
    submitted by /u/henlo_there_fren [link] [comments]  ( 40 min )
    AI Characters Become Self-Aware Ft. MrBeast - (Funny Moments)
    submitted by /u/SnowDustHD [link] [comments]  ( 40 min )
    AI Dream 150 - INTERSTELLAR PORTAL Part4 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI will also help you find a job: job offers related to this area grow by 31%
    submitted by /u/nikko_fan [link] [comments]  ( 40 min )
    What are some AI models out there that are not owned by large businesses that require my personal info to use?
    submitted by /u/temporaryAMA [link] [comments]  ( 40 min )
    Explain Cinderella as if you are Donald Trump
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Should I be learning data analysis now if my goal to become an AI engineer/researcher?
    Hello, I'm currently self-studying math with the goal of learning up to liner-algebra and then returning to school for a BS in mathematics followed by a graduate and then post-graduate degree in a subjecting relating to AI research. Given this goal I was wondering if I should also be self-studying data analysis along with the math. I reason that this might be good since I'd be getting hands-on experience with data and I might even be able to do it as a job while I study. In my math studies I'm currently on Algebra 2 so if I should be learning data analysis, can I start learning it now or would it be better to start after I've learned more advanced math like calculus? submitted by /u/multioutletplug [link] [comments]  ( 42 min )
    Hello guys. Could you please share experience with learning and working with Neural networks? I have already done with some of the tutorials, but want to know which language is best suitable to create AI (Python, Java).
    submitted by /u/Your_mag [link] [comments]  ( 42 min )
    Google's MusicLM is Astoundingly Good at Making AI-Generated Music, But They're Not Releasing it Due to Copyright Concerns
    submitted by /u/HODLTID [link] [comments]  ( 45 min )
    Research
    Hello everyone, I am currently doing a college research on artificial intelligence and ethics, looking to specify in the field of Bias and fairness in AI. I am looking for someone with some knowledge in the field to conduct a brief interview with in which I ask a couple of questions related to the topic. If you are interested, kindly @ me. Thanks :) submitted by /u/AyanT_01 [link] [comments]  ( 40 min )
    ChatGPT is not all that good
    I get inaccurate information form ChatGPT when I ask questions in areas where I have a good bit of knowledge. I couple of times it has cited specific books or papers by title and author and, as far as I can tell, the books/papers don't exists. The second time it did this, I ask it why it makes this kind of odd mistakes and it said: "As a language model, my information is based on the text that I have been trained on, and I try to provide the most accurate and relevant information based on that data. However, I am not infallible and sometimes make mistakes or provide information that is inaccurate or out-of-date. I apologize for any inaccuracies in my previous responses, and for any confusion that they may have caused. It's important to know that the information provided by AI models like me should be taken as a starting point for further research and not as the final answer. It's always a good idea to consult with experts or verify information with multiple sources. Again, I apologize for any inaccuracies in my previous statements and any confusion they may have caused." But it is still weird that it would make up specific books and papers, I don't see why a language model would do that. Edit: Maybe the text that it is trained on references specific books and paper that don't exist. And I guess that would be text that are not indexed by Google and other places where I am doing my fact checking such that I cannot figure out where it is getting this stuff. submitted by /u/facinabush [link] [comments]  ( 46 min )
    Made a website to teach you how to make passive/active income using AI!
    I recently created a website called https://cashwithai.com that is dedicated to helping people learn how to make money using AI like ChatGPT. The website offers a variety of resources, including a QuickStart guide, case studies, and tips and tricks for monetizing AI-generated content. Additionally, I'm offering free 1-on-1 consultations to anyone who is looking for personalized advice and guidance on how to make money with AI. I'm not running ads or charging; I run purely off donations. Let me know if you have any questions! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    New Incredible Text-To-Music Generation Model By Google
    submitted by /u/SupPandaHugger [link] [comments]  ( 40 min )
    What is the information content in a logic statement?
    I recently had a lot of fun working with my talented peers at IBM Research putting together a point of view to this fascinating question. Wanted to share with you our work and to get feedback from you. ibm.biz/logic_and_information submitted by /u/Consistent_Listen127 [link] [comments]  ( 40 min )
    AI will not replace marketers, but marketers who use AI will replace those who don’t.
    AI will not replace marketers, but marketers who use AI will replace those who don’t. submitted by /u/TheVellerShow [link] [comments]  ( 40 min )
    OpenAI Hired Developers to Train its AI to Replace Developers
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 24 min )
    image to voice ai stable image to voice
    submitted by /u/VNKT-FOREVER [link] [comments]  ( 40 min )
    Real world applications of AI
    There has been many advancements in especially in CV, NLP etc. But, I don't see many new AI models that are solving real world problems. There has been a lot of advancements in AI in RL, DL etc but I rarely see new applications in real world. All i see are text to image models, advanced chat bots, better game playing AIs etc. These are definitely amazing, but, I wanna see new stuff which is making a good impact on the real world. Alpha-fold, driverless cars etc are sorta things I am looking for. I don't know if I am just bluntly unaware of new stuff in AI which solves practical problems, or whether not much new stuff is happening in those areas. So, I would be glad if I can know more about how AI is being used in new ways to solve real-wprld problems, or any new AI research trying to tackle a real world problem? Sry if it's a stupid question, and sry for spamming "real world" too much. submitted by /u/Unhappy_Version7565 [link] [comments]  ( 44 min )
    Want to catch up on the breakthroughs in AI the last 10 years. What should I read?
    Hi guys, I have a technical background and have studied some AI in the past. I'd like to catch up on the latest developments in AI over the last ~10 years. Do you have any recommendations on what to read? I was thinking of maybe trying to get the top most cited papers in the last 10 years and reading them. But I'm not sure where would be the best place to find that. Any suggestions? Thanks in advance. submitted by /u/AlexWD [link] [comments]  ( 41 min )
    spooky season 💅
    submitted by /u/Rich_Dragon17 [link] [comments]  ( 40 min )
    BuzzFeed Plans to 'Hire' ChatGPT as Newest Content Creator
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    [P] tiny-diffusion: a minimal PyTorch implementation of probabilistic diffusion models for 2D datasets
    submitted by /u/tanelai [link] [comments]  ( 43 min )
    [D] Running a small Bloom checkpoint on a mini pc.
    Has anyone been able to run a small LLM checkpoint in commodity hardware, like a mini PC? If so, what were your specs? submitted by /u/onedjscream [link] [comments]  ( 42 min )
    [D] could multiple-input transformers reduce the pain of the training data acquisition problem?
    so it's big pop-sci news that we are running out of quality textual training data (soft-paywalled article, but you get the idea) to produce chinchilla-optimal language models, and they appear to continue learning new abilities as data and parameter size increase. when an infant learns what a cat is, it is not only described, but the infant can see it and understand its form and behavior in a way it can then go on to describe and extrapolate from (even if they are blind, they can touch it and understand its shape and feel its fur). LLMs have to do this the hard way: their generalized understanding of the shape and behaviour of a cat comes from textual descriptions of them (and they would need quite a lot in order to understand!) most of the research i have seen into multiple input transformer models has been with the purpose of task completion (google's embodied language model robot butlers etc, which often use textual descriptions fed to a normal LLM, see https://innermonologue.github.io/ ) or image recognition and understanding (such as in CLIP) but not necessarily applying it to textual completion, which seems like it could benefit from a more visual understanding of the world so, in the medium or short term, to improve performance on text-completion tasks, what are your thoughts on using image training as well as textual to improve generalization for LLMs with fewer text tokens on a new architecture? (also, please excuse any ignorance i may posess: i'm a bit of an armchair ai enjoyer) submitted by /u/Dankmemexplorer [link] [comments]  ( 44 min )
    GAN Training Gradient questions [D]
    The main reason why this is not in the simple questions thread is the need to includee an image. Here is an image of my generator's and discriminator's gradients being logged onto wandb. As you can see, they have these weird hasms, where the distribution of the gradients becomes very close to zero. These chasms are correlated for the discriminator and generator and seem very regular. Anyone experienced anything like this and maybe has a hunch of what might be the cause? https://preview.redd.it/3qipyez6stea1.png?width=624&format=png&auto=webp&s=f5fb39803082add2fd78979e880e6e784b9e4c0c submitted by /u/Hub_Pli [link] [comments]  ( 43 min )
    [D] Goodness of fit question
    For regressions, R-squared and Adj. R-Squared are typically considered the primary goodness-of-fit measures. But in many supervised machine-learning models, RMSE is the main measure that I keep running across. For example, decision tree models that I create in R using Rpart do that. So, my question is how to compare the predictive accuracy of OLS regression models that report R-sq to equivalent Rpart regression trees that report RMSE. submitted by /u/bridgeton_man [link] [comments]  ( 43 min )
    [N] OpenAI has 1000s of contractors to fine-tune codex
    submitted by /u/yazriel0 [link] [comments]  ( 45 min )
    [P] GPT JupyterLab - JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells.
    Hi everyone, I made a JupyterLab extension to use OpenAI’s GPT models for code and text completion on your notebook cells. This extension passes your current notebook cell to the GPT API and completes your code/text for you. You can customize the GPT parameters in the Advanced Settings menu. I made this extension when I couldn't find any Copilot/Codex extensions for JupyterLab. It doesn't make sense that ML folks don't have an easy way to use AI generated code in their own tools. VS Code does allow you use Copilot, but I've gotten used to Jupyter and a lot of ML/DS folks I know still prefer using Jupyter over VS code. Installation pip install gpt_jupyterlab GitHub Repo: https://github.com/henshinger/gpt-jupyterlab/ Demo GPT JupyterLab Demo Note: You will need your own OpenAI API Key to use this extension. Would love to get your feedback! submitted by /u/henshinger [link] [comments]  ( 44 min )
    [P] Launching my first ever open-source project and it might make your ChatGPT answers better
    I am building an open-source ML observability and refinement toolkit which recently got investment from YCombinator. The tool helps ML practitioners to: 1. Understand how their models are performing in production 2. Catch edge-cases and outliers to help them refine their models 3. Allow them to customise the tool according to their needs (hence, open-source) 4. Bring data-security at the forefront (hence, self hosted) You can check out the project https://github.com/uptrain-ai/uptrain and would love to hear feedback from the community submitted by /u/Vegetable-Skill-9700 [link] [comments]  ( 43 min )
    [P] Parse research papers into structured data
    ​ https://preview.redd.it/1t7spoqxdsea1.png?width=1920&format=png&auto=webp&s=b4643e418d942260b16019ee250edf56a4336b4b paperai is a semantic search and workflow application for medical/scientific papers. It can be used to take a set of research papers, parse the content and turn it into structured data. paperetl is the underlying library that parses basic metadata such as title, publication and date out of the papers. https://preview.redd.it/5jywynarfsea1.png?width=1084&format=png&auto=webp&s=437b6cdb65fbf12d6d238f84fd94ec4b85dec93b In addition to standard metadata, paperai can also run extractive queries to build additional columns. https://preview.redd.it/8rzfj016esea1.png?width=1138&format=png&auto=webp&s=ccd73c34f0f3001dfe05a0f8c480cf73451dd447 Example notebooks and Docker files can be found on GitHub. paperai | paperetl submitted by /u/davidmezzetti [link] [comments]  ( 43 min )
    [P] Annotating text with sparse human annotations and different length chunks
    I want to automate the annotation of a domain-specific text (complicated contracts) by finetuning a BERT model. However the annotated text I've been provided by the domain experts has been sparsely annotated (i.e. paragraph 40-48 has been fully annotated, while 15 other paragraphs only have certain classes annotated for certain words. Most paragraphs have nothing annotated (like 70% of the entire corpus) Another complication is that for 1 class, the entire paragraph should be annotated in this class, while for others it's a single word or a sentence. There are 7 classes in total and in the end, all tokens should be annotated to one of the 7 classes. 6 out of 7 classes are also pretty domain-specific and not something like 'location' or 'person' or POS. I've been thinking about using an annotation framework (i.e. LabelStudio or Prodigy) that supports custom models (i.e. finetuned BERT) and active learning to rapidly increase annotated texts by speeding up the work of human annotators (which are domain experts and usually don't have a lot of time for this). However, it's pretty unclear whether my use case would be support by this, especially with the issue of sparse text. I've also considered making the problem easier by using another model for the specific class that applies to an entire paragraph. And/or by using the fully annotated paragraphs for finetuning and using the sparse paragraphs for validation. A final consideration is using GPT-3, but I'm not sure how/if it is able to classify entire sentences/paragraphs with multiple classes and how the prompt should be formatted as. Any suggestions/ideas? submitted by /u/Background_Claim7907 [link] [comments]  ( 44 min )
    [P] Framework agnostic python package for running RWKV, RNN based models.
    https://pypi.org/project/rwkvstic/ Currently supports tensorflow, pytorch, jax Also has support for tensor streaming, 8bit jit-quant and multi-gpu. Run RWKV 7B on 8GB of vram or 14B on 16GB of vram. submitted by /u/hazardous1222 [link] [comments]  ( 42 min )
    [R] META presents MAV3D — text to 3D video
    submitted by /u/SpatialComputing [link] [comments]  ( 45 min )
    [D] Could forward-forward learning enable training large models with distributed computing?
    One problem with distributed learning with backprop is that the first layer can't update their weights until the computation has travelled all the way down to the last layer and then backpropagated back up. If all your layers are on different machines connected by a high-latency internet connection, this will take a long time. In forward-forward learning, learning is local - each layer operates independently and only needs to communicate with the layers above and below it. The results are almost-but-not-quite as good as backprop. But each layer can immediately update their weights based only on the information they received from the previous layer. Network latency no longer matters; the limit is just the bandwidth of the slowest machine. submitted by /u/currentscurrents [link] [comments]  ( 46 min )
  • Open

    Laptop Recommendations for RL
    I am looking to buy a laptop for my rl projects and I wanted to know what people in the industry recommended for training models locally and how significant OS, CPU and GPUs really are. submitted by /u/PleasantBase6967 [link] [comments]  ( 43 min )
    The value of RL feedback on language models: "[Character.ai] engagement rose by more than 30 percent." --Noam Shazeer
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Optimal machine to run 350M language model on?
    Trying to set up a discord bot. submitted by /u/Ah_Books [link] [comments]  ( 40 min )
  • Open

    Don't overfit the history -- Recursive time series data augmentation. (arXiv:2207.02891v2 [cs.LG] UPDATED)
    Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks, we need to understand that we fit our model on available data, which is a unique realized history. Training on a single realization often induces severe overfitting lacking generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call Recursive Interpolation Method, denoted as RIM. New samples are generated using a recursive interpolation function of all previous values in such a way that the enhanced samples preserve the original inherent time series dynamics. We perform theoretical analysis to characterize the proposed RIM and to guarantee its test performance. We apply RIM to diverse real world time series cases to achieve strong performance over non-augmented data on regression, classification, and reinforcement learning tasks.  ( 2 min )
    Interaction Decompositions for Tensor Network Regression. (arXiv:2208.06029v2 [cs.LG] UPDATED)
    It is well known that tensor network regression models operate on an exponentially large feature space, but questions remain as to how effectively they are able to utilize this space. Using a polynomial featurization, we propose the interaction decomposition as a tool that can assess the relative importance of different regressors as a function of their polynomial degree. We apply this decomposition to tensor ring and tree tensor network models trained on the MNIST and Fashion MNIST datasets, and find that up to 75% of interaction degrees are contributing meaningfully to these models. We also introduce a new type of tensor network model that is explicitly trained on only a small subset of interaction degrees, and find that these models are able to match or even outperform the full models using only a fraction of the exponential feature space. This suggests that standard tensor network models utilize their polynomial regressors in an inefficient manner, with the lower degree terms being vastly under-utilized.  ( 2 min )
    MusicLM: Generating Music From Text. (arXiv:2301.11325v1 [cs.SD])
    We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts.  ( 2 min )
    f-divergences and their applications in lossy compression and bounding generalization error. (arXiv:2206.11042v3 [cs.IT] UPDATED)
    In this paper, we provide three applications for $f$-divergences: (i) we introduce Sanov's upper bound on the tail probability of the sum of independent random variables based on super-modular $f$-divergence and show that our generalized Sanov's bound strictly improves over ordinary one, (ii) we consider the lossy compression problem which studies the set of achievable rates for a given distortion and code length. We extend the rate-distortion function using mutual $f$-information and provide new and strictly better bounds on achievable rates in the finite blocklength regime using super-modular $f$-divergences, and (iii) we provide a connection between the generalization error of algorithms with bounded input/output mutual $f$-information and a generalized rate-distortion problem. This connection allows us to bound the generalization error of learning algorithms using lower bounds on the $f$-rate-distortion function. Our bound is based on a new lower bound on the rate-distortion function that (for some examples) strictly improves over previously best-known bounds.  ( 2 min )
    Overcoming the Pitfalls of Prediction Error in Operator Learning for Bilevel Planning. (arXiv:2208.07737v2 [cs.AI] UPDATED)
    Bilevel planning, in which a high-level search over an abstraction of an environment is used to guide low-level decision-making, is an effective approach to solving long-horizon tasks in continuous state and action spaces. Recent work has shown how to enable such bilevel planning by learning action and transition model abstractions in the form of symbolic operators and neural samplers. In this work, we show that existing symbolic operator learning approaches fall short in many natural environments where agent actions tend to cause a large number of irrelevant propositions to change. This is primarily because they attempt to learn operators that optimize the prediction error with respect to observed changes in the propositions. To overcome this issue, we propose to learn operators that only model changes necessary for abstract planning to achieve the specified goal. Experimentally, we show that our approach learns operators that lead to efficient planning across 10 different hybrid robotics domains, including 4 from the challenging BEHAVIOR-100 benchmark, with generalization to novel initial states, goals, and objects.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning. (arXiv:2206.08657v3 [cs.CV] UPDATED)
    Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose Bridge-Tower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at \url{https://github.com/microsoft/BridgeTower}.  ( 2 min )
    Ontology-enhanced Prompt-tuning for Few-shot Learning. (arXiv:2201.11332v1 [cs.CL] CROSS LISTED)
    Few-shot Learning (FSL) is aimed to make predictions based on a limited number of samples. Structured data such as knowledge graphs and ontology libraries has been leveraged to benefit the few-shot setting in various tasks. However, the priors adopted by the existing methods suffer from challenging knowledge missing, knowledge noise, and knowledge heterogeneity, which hinder the performance for few-shot learning. In this study, we explore knowledge injection for FSL with pre-trained language models and propose ontology-enhanced prompt-tuning (OntoPrompt). Specifically, we develop the ontology transformation based on the external knowledge graph to address the knowledge missing issue, which fulfills and converts structure knowledge to text. We further introduce span-sensitive knowledge injection via a visible matrix to select informative knowledge to handle the knowledge noise issue. To bridge the gap between knowledge and text, we propose a collective training algorithm to optimize representations jointly. We evaluate our proposed OntoPrompt in three tasks, including relation extraction, event extraction, and knowledge graph completion, with eight datasets. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.  ( 2 min )
    Predicting Wind-Driven Spatial Deposition through Simulated Color Images using Deep Autoencoders. (arXiv:2202.01762v3 [cs.LG] UPDATED)
    For centuries, scientists have observed nature to understand the laws that govern the physical world. The traditional process of turning observations into physical understanding is slow. Imperfect models are constructed and tested to explain relationships in data. Powerful new algorithms can enable computers to learn physics by observing images and videos. Inspired by this idea, instead of training machine learning models using physical quantities, we used images, that is, pixel information. For this work, and as a proof of concept, the physics of interest are wind-driven spatial patterns. These phenomena include features in Aeolian dunes and volcanic ash deposition, wildfire smoke, and air pollution plumes. We use computer model simulations of spatial deposition patterns to approximate images from a hypothetical imaging device whose outputs are red, green, and blue (RGB) color images with channel values ranging from 0 to 255. In this paper, we explore deep convolutional neural network-based autoencoders to exploit relationships in wind-driven spatial patterns, which commonly occur in geosciences, and reduce their dimensionality. Reducing the data dimension size with an encoder enables training deep, fully connected neural network models linking geographic and meteorological scalar input quantities to the encoded space. Once this is achieved, full spatial patterns are reconstructed using the decoder. We demonstrate this approach on images of spatial deposition from a pollution source, where the encoder compresses the dimensionality to 0.02% of the original size, and the full predictive model performance on test data achieves a normalized root mean squared error of 8%, a figure of merit in space of 94% and a precision-recall area under the curve of 0.93.  ( 3 min )
    Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models. (arXiv:2212.10561v2 [cs.CL] UPDATED)
    Despite recent success in large language model (LLM) reasoning, LLMs struggle with hierarchical multi-step reasoning tasks like generating complex programs. For these tasks, humans often start with a high-level algorithmic design and implement each part gradually. We introduce Parsel, a framework enabling automatic implementation and validation of complex algorithms with code LLMs, taking hierarchical function descriptions in natural language as input. We show that Parsel can be used across domains requiring hierarchical reasoning, including program synthesis, robotic planning, and theorem proving. We show that LLMs generating Parsel solve more competition-level problems in the APPS dataset, resulting in pass rates that are over 75% higher than prior results from directly sampling AlphaCode and Codex, while often using a smaller sample budget. We also find that LLM-generated robotic plans using Parsel as an intermediate language are more than twice as likely to be considered accurate than directly generated plans. Lastly, we explore how Parsel addresses LLM limitations and discuss how Parsel may be useful for human programmers.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Adaptive Gradient Methods with Local Guarantees. (arXiv:2203.01400v3 [cs.LG] UPDATED)
    Adaptive gradient methods are the method of choice for optimization in machine learning and used to train the largest deep models. In this paper we study the problem of learning a local preconditioner, that can change as the data is changing along the optimization trajectory. We propose an adaptive gradient method that has provable adaptive regret guarantees vs. the best local preconditioner. To derive this guarantee, we prove a new adaptive regret bound in online learning that improves upon previous adaptive online learning methods. We demonstrate the robustness of our method in automatically choosing the optimal learning rate schedule for popular benchmarking tasks in vision and language domains. Without the need to manually tune a learning rate schedule, our method can, in a single run, achieve comparable and stable task accuracy as a fine-tuned optimizer.  ( 2 min )
    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised Learning. (arXiv:2301.10921v1 [cs.LG])
    The critical challenge of Semi-Supervised Learning (SSL) is how to effectively leverage the limited labeled data and massive unlabeled data to improve the model's generalization performance. In this paper, we first revisit the popular pseudo-labeling methods via a unified sample weighting formulation and demonstrate the inherent quantity-quality trade-off problem of pseudo-labeling with thresholding, which may prohibit learning. To this end, we propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training, effectively exploiting the unlabeled data. We derive a truncated Gaussian function to weight samples based on their confidence, which can be viewed as a soft version of the confidence threshold. We further enhance the utilization of weakly-learned classes by proposing a uniform alignment approach. In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.  ( 2 min )
    Explaining Visual Biases as Words by Generating Captions. (arXiv:2301.11104v1 [cs.LG])
    We aim to diagnose the potential biases in image classifiers. To this end, prior works manually labeled biased attributes or visualized biased features, which need high annotation costs or are often ambiguous to interpret. Instead, we leverage two types (generative and discriminative) of pre-trained vision-language models to describe the visual bias as a word. Specifically, we propose bias-to-text (B2T), which generates captions of the mispredicted images using a pre-trained captioning model to extract the common keywords that may describe visual biases. Then, we categorize the bias type as spurious correlation or majority bias by checking if it is specific or agnostic to the class, based on the similarity of class-wise mispredicted images and the keyword upon a pre-trained vision-language joint embedding space, e.g., CLIP. We demonstrate that the proposed simple and intuitive scheme can recover well-known gender and background biases, and discover novel ones in real-world datasets. Moreover, we utilize B2T to compare the classifiers using different architectures or training methods. Finally, we show that one can obtain debiased classifiers using the B2T bias keywords and CLIP, in both zero-shot and full-shot manners, without using any human annotation on the bias.  ( 2 min )
    On the Convergence of No-Regret Learning Dynamics in Time-Varying Games. (arXiv:2301.11241v1 [cs.LG])
    Most of the literature on learning in games has focused on the restrictive setting where the underlying repeated game does not change over time. Much less is known about the convergence of no-regret learning algorithms in dynamic multiagent settings. In this paper, we characterize the convergence of \emph{optimistic gradient descent (OGD)} in time-varying games by drawing a strong connection with \emph{dynamic regret}. Our framework yields sharp convergence bounds for the equilibrium gap of OGD in zero-sum games parameterized on the \emph{minimal} first-order variation of the Nash equilibria and the second-order variation of the payoff matrices, subsuming known results for static games. Furthermore, we establish improved \emph{second-order} variation bounds under strong convexity-concavity, as long as each game is repeated multiple times. Our results also apply to time-varying \emph{general-sum} multi-player games via a bilinear formulation of correlated equilibria, which has novel implications for meta-learning and for obtaining refined variation-dependent regret bounds, addressing questions left open in prior papers. Finally, we leverage our framework to also provide new insights on dynamic regret guarantees in static games.  ( 2 min )
    Box$^2$EL: Concept and Role Box Embeddings for the Description Logic EL++. (arXiv:2301.11118v1 [cs.AI])
    Representation learning in the form of semantic embeddings has been successfully applied to a variety of tasks in natural language processing and knowledge graphs. Recently, there has been growing interest in developing similar methods for learning embeddings of entire ontologies. We propose Box$^2$EL, a novel method for representation learning of ontologies in the Description Logic EL++, which represents both concepts and roles as boxes (i.e. axis-aligned hyperrectangles), such that the logical structure of the ontology is preserved. We theoretically prove the soundness of our model and conduct an extensive empirical evaluation, in which we achieve state-of-the-art results in subsumption prediction, link prediction, and deductive reasoning. As part of our evaluation, we introduce a novel benchmark for evaluating EL++ embedding models on predicting subsumptions involving both atomic and complex concepts.  ( 2 min )
    Finding Regions of Counterfactual Explanations via Robust Optimization. (arXiv:2301.11113v1 [cs.LG])
    Counterfactual explanations play an important role in detecting bias and improving the explainability of data-driven classification models. A counterfactual explanation (CE) is a minimal perturbed data point for which the decision of the model changes. Most of the existing methods can only provide one CE, which may not be achievable for the user. In this work we derive an iterative method to calculate robust CEs, i.e. CEs that remain valid even after the features are slightly perturbed. To this end, our method provides a whole region of CEs allowing the user to choose a suitable recourse to obtain a desired outcome. We use algorithmic ideas from robust optimization and prove convergence results for the most common machine learning methods including logistic regression, decision trees, random forests, and neural networks. Our experiments show that our method can efficiently generate globally optimal robust CEs for a variety of common data sets and classification models.  ( 2 min )
    Neural Inverse Operators for Solving PDE Inverse Problems. (arXiv:2301.11167v1 [cs.LG])
    A large class of inverse problems for PDEs are only well-defined as mappings from operators to functions. Existing operator learning frameworks map functions to functions and need to be modified to learn inverse maps from data. We propose a novel architecture termed Neural Inverse Operators (NIOs) to solve these PDE inverse problems. Motivated by the underlying mathematical structure, NIO is based on a suitable composition of DeepONets and FNOs to approximate mappings from operators to functions. A variety of experiments are presented to demonstrate that NIOs significantly outperform baselines and solve PDE inverse problems robustly, accurately and are several orders of magnitude faster than existing direct and PDE-constrained optimization methods.  ( 2 min )
    On the Mathematics of Diffusion Models. (arXiv:2301.11108v1 [cs.LG])
    This paper attempts to present the stochastic differential equations of diffusion models in a manner that is accessible to a broad audience. The diffusion process is defined over a population density in R^d. Of particular interest is a population of images. In a diffusion model one first defines a diffusion process that takes a sample from the population and gradually adds noise until only noise remains. The fundamental idea is to sample from the population by a reverse-diffusion process mapping pure noise to a population sample. The diffusion process is defined independent of any ``interpretation'' but can be analyzed using the mathematics of variational auto-encoders (the ``VAE interpretation'') or the Fokker-Planck equation (the ``score-matching intgerpretation''). Both analyses yield reverse-diffusion methods involving the score function. The Fokker-Planck analysis yields a family of reverse-diffusion SDEs parameterized by any desired level of reverse-diffusion noise including zero (deterministic reverse-diffusion). The VAE analysis yields the reverse-diffusion SDE at the same noise level as the diffusion SDE. The VAE analysis also yields a useful expression for computing the population probabilities of a given point (image). This formula for the probability of a given point does not seem to follow naturally from the Fokker-Planck analysis. Much, but apparently not all, of the mathematics presented here can be found in the literature. Attributions are given at the end of the paper.  ( 2 min )
    Causal Counterfactuals for Improving the Robustness of Reinforcement Learning. (arXiv:2211.05551v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is applied in a wide variety of fields. RL enables agents to learn tasks autonomously by interacting with the environment. The more critical the tasks are, the higher the demand for the robustness of the RL systems. Causal RL combines RL and causal inference to make RL more robust. Causal RL agents use a causal representation to capture the invariant causal mechanisms that can be transferred from one task to another. Currently, there is limited research in Causal RL, and existing solutions are usually not complete or feasible for real-world applications. In this work, we propose CausalCF, the first complete Causal RL solution incorporating ideas from Causal Curiosity and CoPhy. Causal Curiosity provides an approach for using interventions, and CoPhy is modified to enable the RL agent to perform counterfactuals. We apply CausalCF to complex robotic tasks and show that it improves the RL agent's robustness using a realistic simulation environment called CausalWorld.  ( 2 min )
    New Approach to Malware Detection Using Optimized Convolutional Neural Network. (arXiv:2301.11161v1 [cs.CR])
    Cyber-crimes have become a multi-billion-dollar industry in the recent years. Most cybercrimes/attacks involve deploying some type of malware. Malware that viciously targets every industry, every sector, every enterprise and even individuals has shown its capabilities to take entire business organizations offline and cause significant financial damage in billions of dollars annually. Malware authors are constantly evolving in their attack strategies and sophistication and are developing malware that is difficult to detect and can lay dormant in the background for quite some time in order to evade security controls. Given the above argument, Traditional approaches to malware detection are no longer effective. As a result, deep learning models have become an emerging trend to detect and classify malware. This paper proposes a new convolutional deep learning neural network to accurately and effectively detect malware with high precision. This paper is different than most other papers in the literature in that it uses an expert data science approach by developing a convolutional neural network from scratch to establish a baseline of the performance model first, explores and implements an improvement model from the baseline model, and finally it evaluates the performance of the final model. The baseline model initially achieves 98% accurate rate but after increasing the depth of the CNN model, its accuracy reaches 99.183 which outperforms most of the CNN models in the literature. Finally, to further solidify the effectiveness of this CNN model, we use the improved model to make predictions on new malware samples within our dataset.  ( 2 min )
    Planning Automated Driving with Accident Experience Referencing and Common-sense Inferencing. (arXiv:2301.10892v1 [cs.RO])
    Although a typical autopilot system far surpasses humans in term of sensing accuracy, performance stability and response agility, such a system is still far behind humans in the wisdom of understanding an unfamiliar environment with creativity, adaptivity and resiliency. Current AD brains are basically expert systems featuring logical computations, which resemble the thinking flow of a left brain working at tactical level. A right brain is needed to upgrade the safety of automated driving vehicle onto next generation by making intuitive strategical judgements that can supervise the tactical action planning. In this work, we present the concept of an Automated Driving Strategical Brain (ADSB): a framework of a scene perception and scene safety evaluation system that works at a higher abstraction level, incorporating experience referencing, common-sense inferring and goal-and-value judging capabilities, to provide a contextual perspective for decision making within automated driving planning. The ADSB brain architecture is made up of the Experience Referencing Engine (ERE), the Common-sense Referencing Engine (CIE) and the Goal and Value Keeper (GVK). 1,614,748 cases from FARS/CRSS database of NHTSA in the period 1975 to 2018 are used for the training of ERE model. The kernel of CIE is a trained model, COMET-BART by ATOMIC, which can be used to provide directional advice when tactical-level environmental perception conclusions are ambiguous; it can also use future scenario models to remind tactical-level decision systems to plan ahead of a perceived hazard scene. GVK can take in any additional expert-hand-written rules that are of qualitative nature. Moreover, we believe that with good scalability, the ADSB approach provides a potential solution to the problem of long-tail corner cases encountered in the validation of a rule-based planning algorithm.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    Efficient Hyperdimensional Computing. (arXiv:2301.10902v1 [cs.LG])
    Hyperdimensional computing (HDC) uses binary vectors of high dimensions to perform classification. Due to its simplicity and massive parallelism, HDC can be highly energy-efficient and well-suited for resource-constrained platforms. However, in trading off orthogonality with efficiency, hypervectors may use tens of thousands of dimensions. In this paper, we will examine the necessity for such high dimensions. In particular, we give a detailed theoretical analysis of the relationship among dimensions of hypervectors, accuracy, and orthogonality. The main conclusion of this study is that a much lower dimension, typically less than 100, can also achieve similar or even higher detecting accuracy compared with other state-of-the-art HDC models. Based on this insight, we propose a suite of novel techniques to build HDC models that use binary hypervectors of dimensions that are orders of magnitude smaller than those found in the state-of-the-art HDC models, yet yield equivalent or even improved accuracy and efficiency. For image classification, we achieved an HDC accuracy of 96.88\% with a dimension of only 32 on the MNIST dataset. We further explore our methods on more complex datasets like CIFAR-10 and show the limits of HDC computing.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Improving Graph Generation by Restricting Graph Bandwidth. (arXiv:2301.10857v1 [cs.LG])
    Deep graph generative modeling has proven capable of learning the distribution of complex, multi-scale structures characterizing real-world graphs. However, one of the main limitations of existing methods is their large output space, which limits generation scalability and hinders accurate modeling of the underlying distribution. To overcome these limitations, we propose a novel approach that significantly reduces the output space of existing graph generative models. Specifically, starting from the observation that many real-world graphs have low graph bandwidth, we restrict graph bandwidth during training and generation. Our strategy improves both generation scalability and quality without increasing architectural complexity or reducing expressiveness. Our approach is compatible with existing graph generative methods, and we describe its application to both autoregressive and one-shot models. We extensively validate our strategy on synthetic and real datasets, including molecular graphs. Our experiments show that, in addition to improving generation efficiency, our approach consistently improves generation quality and reconstruction accuracy. The implementation is made available.  ( 2 min )
    A Practical Influence Approximation for Privacy-Preserving Data Filtering in Federated Learning. (arXiv:2205.11518v2 [cs.CR] UPDATED)
    Federated Learning by nature is susceptible to low-quality, corrupted, or even malicious data that can severely degrade the quality of the learned model. Traditional techniques for data valuation cannot be applied as the data is never revealed. We present a novel technique for filtering, and scoring data based on a practical influence approximation (`lazy' influence) that can be implemented in a privacy-preserving manner. Each participant uses his own data to evaluate the influence of another participant's batch, and reports to the center an obfuscated score using differential privacy. Our technique allows for highly effective filtering of corrupted data in a variety of applications. Importantly, we show that most of the corrupted data can be filtered out (recall of $>90\%$, and even up to $100\%$), even under really strong privacy guarantees ($\varepsilon \leq 1$).  ( 2 min )
    Qualitative Analysis of a Graph Transformer Approach to Addressing Hate Speech: Adapting to Dynamically Changing Content. (arXiv:2301.10871v1 [cs.LG])
    Our work advances an approach for predicting hate speech in social media, drawing out the critical need to consider the discussions that follow a post to successfully detect when hateful discourse may arise. Using graph transformer networks, coupled with modelling attention and BERT-level natural language processing, our approach can capture context and anticipate upcoming anti-social behaviour. In this paper, we offer a detailed qualitative analysis of this solution for hate speech detection in social networks, leading to insights into where the method has the most impressive outcomes in comparison with competitors and identifying scenarios where there are challenges to achieving ideal performance. Included is an exploration of the kinds of posts that permeate social media today, including the use of hateful images. This suggests avenues for extending our model to be more comprehensive. A key insight is that the focus on reasoning about the concept of context positions us well to be able to support multi-modal analysis of online posts. We conclude with a reflection on how the problem we are addressing relates especially well to the theme of dynamic change, a critical concern for all AI solutions for social impact. We also comment briefly on how mental health well-being can be advanced with our work, through curated content attuned to the extent of hate in posts.  ( 2 min )
    Counterfactual Analysis in Dynamic Latent State Models. (arXiv:2205.13832v2 [cs.LG] UPDATED)
    We provide an optimization-based framework to perform counterfactual analysis in a dynamic model with hidden states. Our framework is grounded in the "abduction, action, and prediction" approach to answer counterfactual queries and handles two key challenges where (1) the states are hidden and (2) the model is dynamic. Recognizing the lack of knowledge on the underlying causal mechanism and the possibility of infinitely many such mechanisms, we optimize over this space and compute upper and lower bounds on the counterfactual quantity of interest. Our work brings together ideas from causality, state-space models, simulation, and optimization, and we apply it on a breast cancer case study. To the best of our knowledge, we are the first to compute lower and upper bounds on a counterfactual query in a dynamic latent-state model.  ( 2 min )
    Assistive Recipe Editing through Critiquing. (arXiv:2205.02454v2 [cs.CL] UPDATED)
    There has recently been growing interest in the automatic generation of cooking recipes that satisfy some form of dietary restrictions, thanks in part to the availability of online recipe data. Prior studies have used pre-trained language models, or relied on small paired recipe data (e.g., a recipe paired with a similar one that satisfies a dietary constraint). However, pre-trained language models generate inconsistent or incoherent recipes, and paired datasets are not available at scale. We address these deficiencies with RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work's main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users' feedback. Experiments on the Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Stochastic Online Fisher Markets: Static Pricing Limits and Adaptive Enhancements. (arXiv:2205.00825v3 [cs.GT] UPDATED)
    In a Fisher market, agents (users) spend a budget of (artificial) currency to buy goods that maximize their utilities while a central planner sets prices on capacity-constrained goods such that the market clears. However, the efficacy of pricing schemes in achieving an equilibrium outcome in Fisher markets typically relies on complete knowledge of users' budgets and utilities and requires that transactions happen in a static market wherein all users are present simultaneously. As a result, we study an online variant of Fisher markets, wherein budget-constrained users with privately known utility and budget parameters, drawn i.i.d. from a distribution $\mathcal{D}$, enter the market sequentially. In this setting, we develop an algorithm that adjusts prices solely based on observations of user consumption, i.e., revealed preference feedback, and achieves a regret and capacity violation of $O(\sqrt{n})$, where $n$ is the number of users and the good capacities scale as $O(n)$. Here, our regret measure is the optimality gap in the objective of the Eisenberg-Gale program between an online algorithm and an offline oracle with complete information on users' budgets and utilities. To establish the efficacy of our approach, we show that any uniform (static) pricing algorithm, including one that sets expected equilibrium prices with complete knowledge of the distribution $\mathcal{D}$, cannot achieve both a regret and constraint violation of less than $\Omega(\sqrt{n})$. While our revealed preference algorithm requires no knowledge of the distribution $\mathcal{D}$, we show that if $\mathcal{D}$ is known, then an adaptive variant of expected equilibrium pricing achieves $O(\log(n))$ regret and constant capacity violation for discrete distributions. Finally, we present numerical experiments to demonstrate the performance of our revealed preference algorithm relative to several benchmarks.  ( 3 min )
    Revisiting the Adversarial Robustness-Accuracy Tradeoff in Robot Learning. (arXiv:2204.07373v2 [cs.RO] UPDATED)
    Adversarial training (i.e., training on adversarially perturbed input data) is a well-studied method for making neural networks robust to potential adversarial attacks during inference. However, the improved robustness does not come for free but rather is accompanied by a decrease in overall model accuracy and performance. Recent work has shown that, in practical robot learning applications, the effects of adversarial training do not pose a fair trade-off but inflict a net loss when measured in holistic robot performance. This work revisits the robustness-accuracy trade-off in robot learning by systematically analyzing if recent advances in robust training methods and theory in conjunction with adversarial robot learning, are capable of making adversarial training suitable for real-world robot applications. We evaluate three different robot learning tasks ranging from autonomous driving in a high-fidelity environment amenable to sim-to-real deployment to mobile robot navigation and gesture recognition. Our results demonstrate that, while these techniques make incremental improvements on the trade-off on a relative scale, the negative impact on the nominal accuracy caused by adversarial training still outweighs the improved robustness by an order of magnitude. We conclude that although progress is happening, further advances in robust learning methods are necessary before they can benefit robot learning tasks in practice.  ( 2 min )
    Privacy preserving n-party scalar product protocol. (arXiv:2112.09436v4 [cs.CR] UPDATED)
    Privacy-preserving machine learning enables the training of models on decentralized datasets without the need to reveal the data, both on horizontal and vertically partitioned data. However, it relies on specialized techniques and algorithms to perform the necessary computations. The privacy preserving scalar product protocol, which enables the dot product of vectors without revealing them, is one popular example for its versatility. Unfortunately, the solutions currently proposed in the literature focus mainly on two-party scenarios, even though scenarios with a higher number of data parties are becoming more relevant. For example when performing analyses that require counting the number of samples which fulfill certain criteria defined across various sites, such as calculating the information gain at a node in a decision tree. In this paper we propose a generalization of the protocol for an arbitrary number of parties, based on an existing two-party method. Our proposed solution relies on a recursive resolution of smaller scalar products. After describing our proposed method, we discuss potential scalability issues. Finally, we describe the privacy guarantees and identify any concerns, as well as comparing the proposed method to the original solution in this aspect.  ( 2 min )
    A Light-weight Deep Human Activity Recognition Algorithm Using Multi-knowledge Distillation. (arXiv:2107.07331v4 [cs.LG] UPDATED)
    Inertial sensor-based human activity recognition (HAR) is the base of many human-centered mobile applications. Deep learning-based fine-grained HAR models enable accurate classification in various complex application scenarios. Nevertheless, the large storage and computational overhead of the existing fine-grained deep HAR models hinder their widespread deployment on resource-limited platforms. Inspired by the knowledge distillation's reasonable model compression and potential performance improvement capability, we design a multi-level HAR modeling pipeline called Stage-Logits-Memory Distillation (SMLDist) based on the widely-used MobileNet. By paying more attention to the frequency-related features during the distillation process, the SMLDist improves the HAR classification robustness of the students. We also propose an auto-search mechanism in the heterogeneous classifiers to improve classification performance. Extensive simulation results demonstrate that SMLDist outperforms various state-of-the-art HAR frameworks in accuracy and F1 macro score. The practical evaluation of the Jetson Xavier AGX platform shows that the SMLDist model is both energy-efficient and computation-efficient. These experiments validate the reasonable balance between the robustness and efficiency of the proposed model. The comparative experiments of knowledge distillation on six public datasets also demonstrate that the SMLDist outperforms other advanced knowledge distillation methods of students' performance, which verifies the good generalization of the SMLDist on other classification tasks, including but not limited to HAR.  ( 2 min )
    Textual Explanations and Critiques in Recommendation Systems. (arXiv:2205.07268v2 [cs.LG] UPDATED)
    Artificial intelligence and machine learning algorithms have become ubiquitous. Although they offer a wide range of benefits, their adoption in decision-critical fields is limited by their lack of interpretability, particularly with textual data. Moreover, with more data available than ever before, it has become increasingly important to explain automated predictions. Generally, users find it difficult to understand the underlying computational processes and interact with the models, especially when the models fail to generate the outcomes or explanations, or both, correctly. This problem highlights the growing need for users to better understand the models' inner workings and gain control over their actions. This dissertation focuses on two fundamental challenges of addressing this need. The first involves explanation generation: inferring high-quality explanations from text documents in a scalable and data-driven manner. The second challenge consists in making explanations actionable, and we refer to it as critiquing. This dissertation examines two important applications in natural language processing and recommendation tasks. Overall, we demonstrate that interpretability does not come at the cost of reduced performance in two consequential applications. Our framework is applicable to other fields as well. This dissertation presents an effective means of closing the gap between promise and practice in artificial intelligence.  ( 2 min )
    RoFL: Robustness of Secure Federated Learning. (arXiv:2107.03311v4 [cs.CR] UPDATED)
    Even though recent years have seen many attacks exposing severe vulnerabilities in Federated Learning (FL), a holistic understanding of what enables these attacks and how they can be mitigated effectively is still lacking. In this work, we demystify the inner workings of existing (targeted) attacks. We provide new insights into why these attacks are possible and why a definitive solution to FL robustness is challenging. We show that the need for ML algorithms to memorize tail data has significant implications for FL integrity. This phenomenon has largely been studied in the context of privacy; our analysis sheds light on its implications for ML integrity. We show that certain classes of severe attacks can be mitigated effectively by enforcing constraints such as norm bounds on clients' updates. We investigate how to efficiently incorporate these constraints into secure FL protocols in the single-server setting. Based on this, we propose RoFL, a new secure FL system that extends secure aggregation with privacy-preserving input validation. Specifically, RoFL can enforce constraints such as $L_2$ and $L_\infty$ bounds on high-dimensional encrypted model updates.  ( 2 min )
    Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. (arXiv:2203.11085v3 [cs.HC] UPDATED)
    Creating presentation slides is a critical but time-consuming task for data scientists. While researchers have proposed many AI techniques to lift data scientists' burden on data preparation and model selection, few have targeted the presentation creation task. Based on the needs identified from a formative study, this paper presents NB2Slides, an AI system that facilitates users to compose presentations of their data science work. NB2Slides uses deep learning methods as well as example-based prompts to generate slides from computational notebooks, and take users' input (e.g., audience background) to structure the slides. NB2Slides also provides an interactive visualization that links the slides with the notebook to help users further edit the slides. A follow-up user evaluation with 12 data scientists shows that participants believed NB2Slides can improve efficiency and reduces the complexity of creating slides. Yet, participants questioned the future of full automation and suggested a human-AI collaboration paradigm.  ( 2 min )
    Constant matters: Fine-grained Complexity of Differentially Private Continual Observation. (arXiv:2202.11205v5 [cs.DS] UPDATED)
    We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix $M_\mathsf{count}$ and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the {\em completely bounded norm} (cb-norm) of $M_\mathsf{count}$. Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of $M_\mathsf{count}$ for a large range of the dimension of $M_\mathsf{count}$. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning {and the first lower bounds on the additive error for $(\epsilon,\delta)$-differentially-private counting under continual observation.} Subsequent to this work, Henzinger et al. (SODA2023) showed that our factorization also achieves fine-grained mean-squared error.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Learning to Act Safely with Limited Exposure and Almost Sure Certainty. (arXiv:2105.08748v3 [eess.SY] UPDATED)
    This paper puts forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials. This is indeed possible, provided that one is willing to navigate trade-offs between optimality, level of exposure to unsafe events, and the maximum detection time of unsafe actions. We illustrate this concept in two complementary settings. We first focus on the canonical multi-armed bandit problem and study the intrinsic trade-offs of learning safety in the presence of uncertainty. Under mild assumptions on sufficient exploration, we provide an algorithm that provably detects all unsafe machines in an (expected) finite number of rounds. The analysis also unveils a trade-off between the number of rounds needed to secure the environment and the probability of discarding safe machines. We then consider the problem of finding optimal policies for a Markov Decision Process (MDP) with almost sure constraints. We show that the action-value function satisfies a barrier-based decomposition which allows for the identification of feasible policies independently of the reward process. Using this decomposition, we develop a Barrier-learning algorithm, that identifies such unsafe state-action pairs in a finite expected number of steps. Our analysis further highlights a trade-off between the time lag for the underlying MDP necessary to detect unsafe actions, and the level of exposure to unsafe events. Simulations corroborate our theoretical findings, further illustrating the aforementioned trade-offs, and suggesting that safety constraints can speed up the learning process.  ( 2 min )
    Joint Training of Deep Ensembles Fails Due to Learner Collusion. (arXiv:2301.11323v1 [cs.LG])
    Ensembles of machine learning models have been well established as a powerful method of improving performance over a single model. Traditionally, ensembling algorithms train their base learners independently or sequentially with the goal of optimizing their joint performance. In the case of deep ensembles of neural networks, we are provided with the opportunity to directly optimize the true objective: the joint performance of the ensemble as a whole. Surprisingly, however, directly minimizing the loss of the ensemble appears to rarely be applied in practice. Instead, most previous research trains individual models independently with ensembling performed post hoc. In this work, we show that this is for good reason - joint optimization of ensemble loss results in degenerate behavior. We approach this problem by decomposing the ensemble objective into the strength of the base learners and the diversity between them. We discover that joint optimization results in a phenomenon in which base learners collude to artificially inflate their apparent diversity. This pseudo-diversity fails to generalize beyond the training data, causing a larger generalization gap. We proceed to demonstrate the practical implications of this effect finding that, in some cases, a balance between independent training and joint optimization can improve performance over the former while avoiding the degeneracies of the latter.  ( 2 min )
    A Context-based Multi-task Hierarchical Inverse Reinforcement Learning Algorithm. (arXiv:2210.01969v2 [cs.LG] UPDATED)
    Multi-task Imitation Learning (MIL) aims to train a policy capable of performing a distribution of tasks, which is essential for general-purpose robots, based on multi-task expert demonstrations. Existing MIL algorithms suffer from low data efficiency and poor performance on complex long-horizontal tasks. We develop Multi-task Hierarchical Adversarial Inverse Reinforcement Learning (MH-AIRL) to learn hierarchically-structured multi-task policies, which is more beneficial for compositional tasks with long horizons and has higher expert data efficiency through identifying and transferring reusable basic skills across tasks. To realize this, MH-AIRL effectively synthesizes context-based multi-task learning, AIRL (an IL approach), and hierarchical policy learning. Further, MH-AIRL can be adopted to demonstrations without the task or skill annotations (i.e., state-action pairs only) which are more accessible in practice. Theoretical justifications are provided for each module of MH-AIRL, and evaluations on challenging multi-task settings demonstrate superior performance and transferability of the multi-task policies learned with MH-AIRL as compared to SOTA MIL baselines.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Recursive deep learning framework for forecasting the decadal world economic outlook. (arXiv:2301.10874v1 [cs.LG])
    Gross domestic product (GDP) is the most widely used indicator in macroeconomics and the main tool for measuring a country's economic ouput. Due to the diversity and complexity of the world economy, a wide range of models have been used, but there are challenges in making decadal GDP forecasts given unexpected changes such as pandemics and wars. Deep learning models are well suited for modeling temporal sequences have been applied for time series forecasting. In this paper, we develop a deep learning framework to forecast the GDP growth rate of the world economy over a decade. We use Penn World Table as the source of our data, taking data from 1980 to 2019, across 13 countries, such as Australia, China, India, the United States and so on. We test multiple deep learning models, LSTM, BD-LSTM, ED-LSTM and CNN, and compared their results with the traditional time series model (ARIMA,VAR). Our results indicate that ED-LSTM is the best performing model. We present a recursive deep learning framework to predict the GDP growth rate in the next ten years. We predict that most countries will experience economic growth slowdown, stagnation or even recession within five years; only China, France and India are predicted to experience stable, or increasing, GDP growth.  ( 2 min )
    Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection. (arXiv:2211.09858v2 [cs.SD] UPDATED)
    Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.  ( 2 min )
    Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild. (arXiv:2208.07522v5 [cs.LG] UPDATED)
    Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Since moderation policies vary depending on countries and types of products, it is common to train and deploy the models per policy. However, this approach is highly inefficient, especially when the policies change, requiring dataset re-labeling and model re-training on the shifted data distribution. To alleviate this cost inefficiency, social media platforms often employ third-party content moderation services that provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons, instead of directly providing final moderation decisions. However, making a reliable automated moderation decision from the prediction scores of the multiple subtasks for a specific target policy has not been widely explored yet. In this study, we formulate real-world scenarios of content moderation and introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way. Extensive experiments demonstrate that our approach shows better performance in content moderation compared to existing threshold optimization methods and heuristics.  ( 2 min )
    Iterative Teaching by Label Synthesis. (arXiv:2110.14432v5 [cs.LG] UPDATED)
    In this paper, we consider the problem of iterative machine teaching, where a teacher provides examples sequentially based on the current iterative learner. In contrast to previous methods that have to scan over the entire pool and select teaching examples from it in each iteration, we propose a label synthesis teaching framework where the teacher randomly selects input teaching examples (e.g., images) and then synthesizes suitable outputs (e.g., labels) for them. We show that this framework can avoid costly example selection while still provably achieving exponential teachability. We propose multiple novel teaching algorithms in this framework. Finally, we empirically demonstrate the value of our framework.  ( 2 min )
    VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media. (arXiv:2208.09021v3 [cs.CV] UPDATED)
    We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight and novelty of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT can yield relative improvements up to 20% over ViLT and achieve state-of-the-art or comparable performance on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Characterizing the Influence of Graph Elements. (arXiv:2210.07441v2 [cs.LG] UPDATED)
    Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes in model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of an SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.  ( 2 min )
    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. (arXiv:2210.13382v3 [cs.LG] UPDATED)
    Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.  ( 2 min )
    Review of Natural Language Processing in Pharmacology. (arXiv:2208.10228v2 [cs.CL] UPDATED)
    Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.  ( 2 min )
    Cut and Learn for Unsupervised Object Detection and Instance Segmentation. (arXiv:2301.11320v1 [cs.CV])
    We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks using our robust loss function. We further improve the performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP50 by over 2.7 times on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% APbox and 6.6% APmask on COCO when training with 5% labels.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Certified Interpretability Robustness for Class Activation Mapping. (arXiv:2301.11324v1 [cs.LG])
    Interpreting machine learning models is challenging but crucial for ensuring the safety of deep networks in autonomous driving systems. Due to the prevalence of deep learning based perception models in autonomous vehicles, accurately interpreting their predictions is crucial. While a variety of such methods have been proposed, most are shown to lack robustness. Yet, little has been done to provide certificates for interpretability robustness. Taking a step in this direction, we present CORGI, short for Certifiably prOvable Robustness Guarantees for Interpretability mapping. CORGI is an algorithm that takes in an input image and gives a certifiable lower bound for the robustness of the top k pixels of its CAM interpretability map. We show the effectiveness of CORGI via a case study on traffic sign data, certifying lower bounds on the minimum adversarial perturbation not far from (4-5x) state-of-the-art attack methods.  ( 2 min )
    Predictive Crypto-Asset Automated Market Making Architecture for Decentralized Finance using Deep Reinforcement Learning. (arXiv:2211.01346v2 [q-fin.TR] UPDATED)
    The study proposes a quote-driven predictive automated market maker (AMM) platform with on-chain custody and settlement functions, alongside off-chain predictive reinforcement learning capabilities to improve liquidity provision of real-world AMMs. The proposed AMM architecture is an augmentation to the Uniswap V3, a cryptocurrency AMM protocol, by utilizing a novel market equilibrium pricing for reduced divergence and slippage loss. Further, the proposed architecture involves a predictive AMM capability, utilizing a deep hybrid Long Short-Term Memory (LSTM) and Q-learning reinforcement learning framework that looks to improve market efficiency through better forecasts of liquidity concentration ranges, so liquidity starts moving to expected concentration ranges, prior to asset price movement, so that liquidity utilization is improved. The augmented protocol framework is expected have practical real-world implications, by (i) reducing divergence loss for liquidity providers, (ii) reducing slippage for crypto-asset traders, while (iii) improving capital efficiency for liquidity provision for the AMM protocol. To our best knowledge, there are no known protocol or literature that are proposing similar deep learning-augmented AMM that achieves similar capital efficiency and loss minimization objectives for practical real-world applications.  ( 2 min )
    What you need to know to train recurrent neural networks to make Flip Flops memories and more. (arXiv:2010.07858v3 [cs.LG] UPDATED)
    Training neural networks to perform different tasks is relevant across various disciplines beyond Machine Learning. In particular, Recurrent Neural Networks (RNNs) are of great interest to different scientific communities. Open-source frameworks dedicated to Machine Learning, such as Tensorflow [1] and Keras [2] have produced significant changes in the development of technologies that we currently use. One relevant problem that can be approached with them is how to build the models to study dynamical systems and the brain. Specifically, how to extract the relevant information to answer the scientific questions of interest. The purpose of the present work is to contribute to this aim by analyzing a temporal processing task, in this case, a 3-bit Flip Flop memory. The modelling procedure in every step is shown: from equations to the software development. The networks obtained were analyzed to describe the dynamics and to show different visualization and analysis tools. The code developed in this premier is also provided to be used for modelling other tasks or systems.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Trajectory-Aware Eligibility Traces for Off-Policy Reinforcement Learning. (arXiv:2301.11321v1 [cs.LG])
    Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.  ( 2 min )
    Anatomy-aware and acquisition-agnostic joint registration with SynthMorph. (arXiv:2301.11329v1 [eess.IV])
    Affine image registration is a cornerstone of medical-image processing and analysis. While classical algorithms can achieve excellent accuracy, they solve a time-consuming optimization for every new image pair. Deep-learning (DL) methods learn a function that maps an image pair to an output transform. Evaluating the functions is fast, but capturing large transforms can be challenging, and networks tend to struggle if a test-image characteristic shifts from the training domain, such as the contrast or resolution. A majority of affine methods are also agnostic to the anatomy the user wishes to align; the registration will be inaccurate if algorithms consider all structures in the image. We address these shortcomings with a fast, robust, and easy-to-use DL tool for affine and deformable registration of any brain image without preprocessing, right off the MRI scanner. First, we rigorously analyze how competing architectures learn affine transforms across a diverse set of neuroimaging data, aiming to truly capture the behavior of methods in the real world. Second, we leverage a recent strategy to train networks with wildly varying images synthesized from label maps, yielding robust performance across acquisition specifics. Third, we optimize the spatial overlap of select anatomical labels, which enables networks to distinguish between anatomy of interest and irrelevant structures, removing the need for preprocessing that excludes content that would otherwise reduce the accuracy of anatomy-specific registration. We combine the affine model with prior work on deformable registration and test brain-specific registration across a landscape of MRI protocols unseen at training, demonstrating consistent and improved accuracy compared to existing tools. We distribute our code and tool at https://w3id.org/synthmorph, providing a single complete end-to-end solution for registration of brain MRI.  ( 3 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    ZiCo: Zero-shot NAS via Inverse Coefficient of Variation on Gradients. (arXiv:2301.11300v1 [cs.LG])
    Neural Architecture Search (NAS) is widely used to automatically design the neural network with the best performance among a large number of candidate architectures. To reduce the search time, zero-shot NAS aims at designing training-free proxies that can predict the test performance of a given architecture. However, as shown recently, none of the zero-shot proxies proposed to date can actually work consistently better than a naive proxy, namely, the number of network parameters (#Params). To improve this state of affairs, as the main theoretical contribution, we first reveal how some specific gradient properties across different samples impact the convergence rate and generalization capacity of neural networks. Based on this theoretical analysis, we propose a new zero-shot proxy, ZiCo, the first proxy that works consistently better than #Params. We demonstrate that ZiCo works better than State-Of-The-Art (SOTA) proxies on several popular NAS-Benchmarks (NASBench101, NATSBench-SSS/TSS, TransNASBench-101) for multiple applications (e.g., image classification/reconstruction and pixel-level prediction). Finally, we demonstrate that the optimal architectures found via ZiCo are as competitive as the ones found by one-shot and multi-shot NAS methods, but with much less search time. For example, ZiCo-based NAS can find optimal architectures with 78.1%, 79.4%, and 80.4% test accuracy under inference budgets of 450M, 600M, and 1000M FLOPs on ImageNet within 0.4 GPU days.  ( 2 min )
    Open Problems in Applied Deep Learning. (arXiv:2301.11316v1 [cs.LG])
    This work formulates the machine learning mechanism as a bi-level optimization problem. The inner level optimization loop entails minimizing a properly chosen loss function evaluated on the training data. This is nothing but the well-studied training process in pursuit of optimal model parameters. The outer level optimization loop is less well-studied and involves maximizing a properly chosen performance metric evaluated on the validation data. This is what we call the "iteration process", pursuing optimal model hyper-parameters. Among many other degrees of freedom, this process entails model engineering (e.g., neural network architecture design) and management, experiment tracking, dataset versioning and augmentation. The iteration process could be automated via Automatic Machine Learning (AutoML) or left to the intuitions of machine learning students, engineers, and researchers. Regardless of the route we take, there is a need to reduce the computational cost of the iteration step and as a direct consequence reduce the carbon footprint of developing artificial intelligence algorithms. Despite the clean and unified mathematical formulation of the iteration step as a bi-level optimization problem, its solutions are case specific and complex. This work will consider such cases while increasing the level of complexity from supervised learning to semi-supervised, self-supervised, unsupervised, few-shot, federated, reinforcement, and physics-informed learning. As a consequence of this exercise, this proposal surfaces a plethora of open problems in the field, many of which can be addressed in parallel.  ( 2 min )
    BayesSpeech: A Bayesian Transformer Network for Automatic Speech Recognition. (arXiv:2301.11276v1 [eess.AS])
    Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks. These models tend to be lighter weight and require less training time than traditional RNN-based approaches. However, these models take frequentist approach to weight training. In theory, network weights are drawn from a latent, intractable probability distribution. We introduce BayesSpeech for end-to-end Automatic Speech Recognition. BayesSpeech is a Bayesian Transformer Network where these intractable posteriors are learned through variational inference and the local reparameterization trick without recurrence. We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Understanding Finetuning for Factual Knowledge Extraction from Language Models. (arXiv:2301.11293v1 [cs.CL])
    Language models (LMs) pretrained on large corpora of text from the web have been observed to contain large amounts of various types of knowledge about the world. This observation has led to a new and exciting paradigm in knowledge graph construction where, instead of manual curation or text mining, one extracts knowledge from the parameters of an LM. Recently, it has been shown that finetuning LMs on a set of factual knowledge makes them produce better answers to queries from a different set, thus making finetuned LMs a good candidate for knowledge extraction and, consequently, knowledge graph construction. In this paper, we analyze finetuned LMs for factual knowledge extraction. We show that along with its previously known positive effects, finetuning also leads to a (potentially harmful) phenomenon which we call Frequency Shock, where at the test time the model over-predicts rare entities that appear in the training set and under-predicts common entities that do not appear in the training set enough times. We show that Frequency Shock leads to a degradation in the predictions of the model and beyond a point, the harm from Frequency Shock can even outweigh the positive effects of finetuning, making finetuning harmful overall. We then consider two solutions to remedy the identified negative effect: 1- model mixing and 2- mixture finetuning with the LM's pre-training task. The two solutions combined lead to significant improvements compared to vanilla finetuning.  ( 2 min )
    Text-To-4D Dynamic Scene Generation. (arXiv:2301.11280v1 [cs.CV])
    We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.  ( 2 min )
    Classification of vertices on social networks by multiple approaches. (arXiv:2301.11288v1 [cs.SI])
    Due to the advent of the expressions of data other than tabular formats, the topological compositions which make samples interrelated came into prominence. Analogically, those networks can be interpreted as social connections, dataflow maps, citation influence graphs, protein bindings, etc. However, in the case of social networks, it is highly crucial to evaluate the labels of discrete communities. The reason underneath for such a study is the non-negligible importance of analyzing graph networks to partition the vertices by using the topological features of network graphs, solely. For each of these interaction-based entities, a social graph, a mailing dataset, and two citation sets are selected as the testbench repositories. This paper, it was not only assessed the most valuable method but also determined how graph neural networks work and the need to improve against non-neural network approaches which are faster and computationally cost-effective. Also, this paper showed a limit to be excesses by prospective graph neural network variations by using the topological features of networks trialed.  ( 2 min )
    Gaussian process regression and conditional Karhunen-Lo\'{e}ve models for data assimilation in inverse problems. (arXiv:2301.11279v1 [cs.LG])
    We present a model inversion algorithm, CKLEMAP, for data assimilation and parameter estimation in partial differential equation models of physical systems with spatially heterogeneous parameter fields. These fields are approximated using low-dimensional conditional Karhunen-Lo\'{e}ve expansions, which are constructed using Gaussian process regression models of these fields trained on the parameters' measurements. We then assimilate measurements of the state of the system and compute the maximum a posteriori estimate of the CKLE coefficients by solving a nonlinear least-squares problem. When solving this optimization problem, we efficiently compute the Jacobian of the vector objective by exploiting the sparsity structure of the linear system of equations associated with the forward solution of the physics problem. The CKLEMAP method provides better scalability compared to the standard MAP method. In the MAP method, the number of unknowns to be estimated is equal to the number of elements in the numerical forward model. On the other hand, in CKLEMAP, the number of unknowns (CKLE coefficients) is controlled by the smoothness of the parameter field and the number of measurements, and is in general much smaller than the number of discretization nodes, which leads to a significant reduction of computational cost with respect to the standard MAP method. To show its advantage in scalability, we apply CKLEMAP to estimate the transmissivity field in a two-dimensional steady-state subsurface flow model of the Hanford Site by assimilating synthetic measurements of transmissivity and hydraulic head. We find that the execution time of CKLEMAP scales nearly linearly as $N^{1.33}$, where $N$ is the number of discretization nodes, while the execution time of standard MAP scales as $N^{2.91}$. The CKLEMAP method improved execution time without sacrificing accuracy when compared to the standard MAP.  ( 3 min )
    Real-Time Digital Twins: Vision and Research Directions for 6G and Beyond. (arXiv:2301.11283v1 [eess.SP])
    This article presents a vision where \textit{real-time} digital twins of the physical wireless environments are continuously updated using multi-modal sensing data from the distributed infrastructure and user devices, and are used to make communication and sensing decisions. This vision is mainly enabled by the advances in precise 3D maps, multi-modal sensing, ray-tracing computations, and machine/deep learning. This article details this vision, explains the different approaches for constructing and utilizing these real-time digital twins, discusses the applications and open problems, and presents a research platform that can be used to investigate various digital twin research directions.  ( 2 min )
    Online Convex Optimization with Stochastic Constraints: Zero Constraint Violation and Bandit Feedback. (arXiv:2301.11267v1 [math.OC])
    This paper studies online convex optimization with stochastic constraints. We propose a variant of the drift-plus-penalty algorithm that guarantees $O(\sqrt{T})$ expected regret and zero constraint violation, after a fixed number of iterations, which improves the vanilla drift-plus-penalty method with $O(\sqrt{T})$ constraint violation. Our algorithm is oblivious to the length of the time horizon $T$, in contrast to the vanilla drift-plus-penalty method. This is based on our novel drift lemma that provides time-varying bounds on the virtual queue drift and, as a result, leads to time-varying bounds on the expected virtual queue length. Moreover, we extend our framework to stochastic-constrained online convex optimization under two-point bandit feedback. We show that by adapting our algorithmic framework to the bandit feedback setting, we may still achieve $O(\sqrt{T})$ expected regret and zero constraint violation, improving upon the previous work for the case of identical constraint functions. Numerical results demonstrate our theoretical results.  ( 2 min )
    AlignGraph: A Group of Generative Models for Graphs. (arXiv:2301.11273v1 [cs.SI])
    It is challenging for generative models to learn a distribution over graphs because of the lack of permutation invariance: nodes may be ordered arbitrarily across graphs, and standard graph alignment is combinatorial and notoriously expensive. We propose AlignGraph, a group of generative models that combine fast and efficient graph alignment methods with a family of deep generative models that are invariant to node permutations. Our experiments demonstrate that our framework successfully learns graph distributions, outperforming competitors by 25% -560% in relevant performance scores.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    A Benchmark Study by using various Machine Learning Models for Predicting Covid-19 trends. (arXiv:2301.11257v1 [cs.LG])
    Machine learning and deep learning play vital roles in predicting diseases in the medical field. Machine learning algorithms are widely classified as supervised, unsupervised, and reinforcement learning. This paper contains a detailed description of our experimental research work in that we used a supervised machine-learning algorithm to build our model for outbreaks of the novel Coronavirus that has spread over the whole world and caused many deaths, which is one of the most disastrous Pandemics in the history of the world. The people suffered physically and economically to survive in this lockdown. This work aims to understand better how machine learning, ensemble, and deep learning models work and are implemented in the real dataset. In our work, we are going to analyze the current trend or pattern of the coronavirus and then predict the further future of the covid-19 confirmed cases or new cases by training the past Covid-19 dataset by using the machine learning algorithm such as Linear Regression, Polynomial Regression, K-nearest neighbor, Decision Tree, Support Vector Machine and Random forest algorithm are used to train the model. The decision tree and the Random Forest algorithm perform better than SVR in this work. The performance of SVR and lasso regression are low in all prediction areas Because the SVR is challenging to separate the data using the hyperplane for this type of problem. So SVR mostly gives a lower performance in this problem. Ensemble (Voting, Bagging, and Stacking) and deep learning models(ANN) also predict well. After the prediction, we evaluated the model using MAE, MSE, RMSE, and MAPE. This work aims to find the trend/pattern of the covid-19.  ( 3 min )
    Molecular Language Model as Multi-task Generator. (arXiv:2301.11259v1 [cs.LG])
    Molecule generation with desired properties has grown immensely in popularity by disruptively changing the way scientists design molecular structures and providing support for chemical and materials design. However, despite the promising outcome, previous machine learning-based deep generative models suffer from a reliance on complex, task-specific fine-tuning, limited dimensional latent spaces, or the quality of expert rules. In this work, we propose MolGen, a pre-trained molecular language model that effectively learns and shares knowledge across multiple generation tasks and domains. Specifically, we pre-train MolGen with the chemical language SELFIES on more than 100 million unlabelled molecules. We further propose multi-task molecular prefix tuning across several molecular generation tasks and different molecular domains (synthetic & natural products) with a self-feedback mechanism. Extensive experiments show that MolGen can obtain superior performances on well-known molecular generation benchmark datasets. The further analysis illustrates that MolGen can accurately capture the distribution of molecules, implicitly learn their structural characteristics, and efficiently explore the chemical space with the guidance of multi-task molecular prefix tuning. Codes, datasets, and the pre-trained model will be available in https://github.com/zjunlp/MolGen.  ( 2 min )
    BiBench: Benchmarking and Analyzing Network Binarization. (arXiv:2301.11233v1 [cs.CV])
    Network binarization emerges as one of the most promising compression approaches offering extraordinary computation and memory savings by minimizing the bit-width. However, recent research has shown that applying existing binarization algorithms to diverse tasks, architectures, and hardware in realistic scenarios is still not straightforward. Common challenges of binarization, such as accuracy degradation and efficiency limitation, suggest that its attributes are not fully understood. To close this gap, we present BiBench, a rigorously designed benchmark with in-depth analysis for network binarization. We first carefully scrutinize the requirements of binarization in the actual production and define evaluation tracks and metrics for a comprehensive and fair investigation. Then, we evaluate and analyze a series of milestone binarization algorithms that function at the operator level and with extensive influence. Our benchmark reveals that 1) the binarized operator has a crucial impact on the performance and deployability of binarized networks; 2) the accuracy of binarization varies significantly across different learning tasks and neural architectures; 3) binarization has demonstrated promising efficiency potential on edge devices despite the limited hardware support. The results and analysis also lead to a promising paradigm for accurate and efficient binarization. We believe that BiBench will contribute to the broader adoption of binarization and serve as a foundation for future research.  ( 2 min )
    Semi-Supervised Image Captioning by Adversarially Propagating Labeled Data. (arXiv:2301.11174v1 [cs.CV])
    We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models. Constructing a large-scale labeled image captioning dataset is an expensive task in terms of labor, time, and cost. In contrast to manually annotating all the training samples, separately collecting uni-modal datasets is immensely easier, e.g., a large-scale image dataset and a sentence dataset. We leverage such massive unpaired image and caption data upon standard paired data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples in an adversarial learning fashion, where the joint distribution of image and caption is learned. Our method trains a captioner to learn from a paired data and to progressively associate unpaired data. This approach shows noticeable performance improvement even in challenging scenarios including out-of-task data (i.e., relational captioning, where the target task is different from the unpaired data) and web-crawled data. We also show that our proposed method is theoretically well-motivated and has a favorable global optimal property. Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired COCO dataset demonstrate the consistent effectiveness of our semisupervised learning method with unpaired data compared to competing methods.  ( 2 min )
    Causal Graph Discovery from Self and Mutually Exciting Time Series. (arXiv:2301.11197v1 [cs.LG])
    We present a generalized linear structural causal model, coupled with a novel data-adaptive linear regularization, to recover causal directed acyclic graphs (DAGs) from time series. By leveraging a recently developed stochastic monotone Variational Inequality (VI) formulation, we cast the causal discovery problem as a general convex optimization. Furthermore, we develop a non-asymptotic recovery guarantee and quantifiable uncertainty by solving a linear program to establish confidence intervals for a wide range of non-linear monotone link functions. We validate our theoretical results and show the competitive performance of our method via extensive numerical experiments. Most importantly, we demonstrate the effectiveness of our approach in recovering highly interpretable causal DAGs over Sepsis Associated Derangements (SADs) while achieving comparable prediction performance to powerful ``black-box'' models such as XGBoost. Thus, the future adoption of our proposed method to conduct continuous surveillance of high-risk patients by clinicians is much more likely.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    A Graph Neural Network with Negative Message Passing for Graph Coloring. (arXiv:2301.11164v1 [cs.LG])
    Graph neural networks have received increased attention over the past years due to their promising ability to handle graph-structured data, which can be found in many real-world problems such as recommended systems and drug synthesis. Most existing research focuses on using graph neural networks to solve homophilous problems, but little attention has been paid to heterophily-type problems. In this paper, we propose a graph network model for graph coloring, which is a class of representative heterophilous problems. Different from the conventional graph networks, we introduce negative message passing into the proposed graph neural network for more effective information exchange in handling graph coloring problems. Moreover, a new loss function taking into account the self-information of the nodes is suggested to accelerate the learning process. Experimental studies are carried out to compare the proposed graph model with five state-of-the-art algorithms on ten publicly available graph coloring problems and one real-world application. Numerical results demonstrate the effectiveness of the proposed graph neural network.  ( 2 min )
    Deep Laplacian-based Options for Temporally-Extended Exploration. (arXiv:2301.11181v1 [cs.LG])
    Selecting exploratory actions that generate a rich stream of experience for better learning is a fundamental challenge in reinforcement learning (RL). An approach to tackle this problem consists in selecting actions according to specific policies for an extended period of time, also known as options. A recent line of work to derive such exploratory options builds upon the eigenfunctions of the graph Laplacian. Importantly, until now these methods have been mostly limited to tabular domains where (1) the graph Laplacian matrix was either given or could be fully estimated, (2) performing eigendecomposition on this matrix was computationally tractable, and (3) value functions could be learned exactly. Additionally, these methods required a separate option discovery phase. These assumptions are fundamentally not scalable. In this paper we address these limitations and show how recent results for directly approximating the eigenfunctions of the Laplacian can be leveraged to truly scale up options-based exploration. To do so, we introduce a fully online deep RL algorithm for discovering Laplacian-based options and evaluate our approach on a variety of pixel-based tasks. We compare to several state-of-the-art exploration methods and show that our approach is effective, general, and especially promising in non-stationary settings.  ( 2 min )
    Flex-Net: A Graph Neural Network Approach to Resource Management in Flexible Duplex Networks. (arXiv:2301.11166v1 [cs.NI])
    Flexible duplex networks allow users to dynamically employ uplink and downlink channels without static time scheduling, thereby utilizing the network resources efficiently. This work investigates the sum-rate maximization of flexible duplex networks. In particular, we consider a network with pairwise-fixed communication links. Corresponding combinatorial optimization is a non-deterministic polynomial (NP)-hard without a closed-form solution. In this respect, the existing heuristics entail high computational complexity, raising a scalability issue in large networks. Motivated by the recent success of Graph Neural Networks (GNNs) in solving NP-hard wireless resource management problems, we propose a novel GNN architecture, named Flex-Net, to jointly optimize the communication direction and transmission power. The proposed GNN produces near-optimal performance meanwhile maintaining a low computational complexity compared to the most commonly used techniques. Furthermore, our numerical results shed light on the advantages of using GNNs in terms of sample complexity, scalability, and generalization capability.  ( 2 min )
    Train Hard, Fight Easy: Robust Meta Reinforcement Learning. (arXiv:2301.11147v1 [cs.LG])
    A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients. Meta-RL (MRL) addresses this issue by learning a meta-policy that adapts to new tasks. Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty. This limits system reliability whenever test tasks are not known in advance. In this work, we propose a robust MRL objective with a controlled robustness level. Optimization of analogous robust objectives in RL often leads to both biased gradients and data inefficiency. We prove that the former disappears in MRL, and address the latter via the novel Robust Meta RL algorithm (RoML). RoML is a meta-algorithm that generates a robust version of any given MRL algorithm, by identifying and over-sampling harder tasks throughout training. We demonstrate that RoML learns substantially different meta-policies and achieves robust returns on several navigation and continuous control benchmarks.  ( 2 min )
    Which Experiences Are Influential for Your Agent? Policy Iteration with Turn-over Dropout. (arXiv:2301.11168v1 [cs.LG])
    In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent's performance. Information about the influence is valuable for various purposes, including experience cleansing and analysis. One method for estimating the influence of individual experiences is agent comparison, but it is prohibitively expensive when there is a large number of experiences. In this paper, we present PI+ToD as a method for efficiently estimating the influence of experiences. PI+ToD is a policy iteration that efficiently estimates the influence of experiences by utilizing turn-over dropout. We demonstrate the efficiency of PI+ToD with experiments in MuJoCo environments.  ( 2 min )
    Convolutional Learning on Simplicial Complexes. (arXiv:2301.11163v1 [cs.LG])
    We propose a simplicial complex convolutional neural network (SCCNN) to learn data representations on simplicial complexes. It performs convolutions based on the multi-hop simplicial adjacencies via common faces and cofaces independently and captures the inter-simplicial couplings, generalizing state-of-the-art. Upon studying symmetries of the simplicial domain and the data space, it is shown to be permutation and orientation equivariant, thus, incorporating such inductive biases. Based on the Hodge theory, we perform a spectral analysis to understand how SCCNNs regulate data in different frequencies, showing that the convolutions via faces and cofaces operate in two orthogonal data spaces. Lastly, we study the stability of SCCNNs to domain deformations and examine the effects of various factors. Empirical results show the benefits of higher-order convolutions and inter-simplicial couplings in simplex prediction and trajectory prediction.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Federated Learning over Coupled Graphs. (arXiv:2301.11099v1 [cs.LG])
    Graphs are widely used to represent the relations among entities. When one owns the complete data, an entire graph can be easily built, therefore performing analysis on the graph is straightforward. However, in many scenarios, it is impractical to centralize the data due to data privacy concerns. An organization or party only keeps a part of the whole graph data, i.e., graph data is isolated from different parties. Recently, Federated Learning (FL) has been proposed to solve the data isolation issue, mainly for Euclidean data. It is still a challenge to apply FL on graph data because graphs contain topological information which is notorious for its non-IID nature and is hard to partition. In this work, we propose a novel FL framework for graph data, FedCog, to efficiently handle coupled graphs that are a kind of distributed graph data, but widely exist in a variety of real-world applications such as mobile carriers' communication networks and banks' transaction networks. We theoretically prove the correctness and security of FedCog. Experimental results demonstrate that our method FedCog significantly outperforms traditional FL methods on graphs. Remarkably, our FedCog improves the accuracy of node classification tasks by up to 14.7%.  ( 2 min )
    FedHQL: Federated Heterogeneous Q-Learning. (arXiv:2301.11135v1 [cs.LG])
    Federated Reinforcement Learning (FedRL) encourages distributed agents to learn collectively from each other's experience to improve their performance without exchanging their raw trajectories. The existing work on FedRL assumes that all participating agents are homogeneous, which requires all agents to share the same policy parameterization (e.g., network architectures and training configurations). However, in real-world applications, agents are often in disagreement about the architecture and the parameters, possibly also because of disparate computational budgets. Because homogeneity is not given in practice, we introduce the problem setting of Federated Reinforcement Learning with Heterogeneous And bLack-box agEnts (FedRL-HALE). We present the unique challenges this new setting poses and propose the Federated Heterogeneous Q-Learning (FedHQL) algorithm that principally addresses these challenges. We empirically demonstrate the efficacy of FedHQL in boosting the sample efficiency of heterogeneous agents with distinct policy parameterization using standard RL tasks.  ( 2 min )
    Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning. (arXiv:2301.11153v1 [cs.LG])
    Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.  ( 2 min )
    SQ Lower Bounds for Random Sparse Planted Vector Problem. (arXiv:2301.11124v1 [cs.LG])
    Consider the setting where a $\rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $n\geq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $n\ll \rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $n\ll \rho^2 d^{2}$ and $\rho\gg \frac{1}{\sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minerva: A File-Based Ransomware Detector. (arXiv:2301.11050v1 [cs.CR])
    Ransomware is a rapidly evolving type of malware designed to encrypt user files on a device, making them inaccessible in order to exact a ransom. Ransomware attacks resulted in billions of dollars in damages in recent years and are expected to cause hundreds of billions more in the next decade. With current state-of-the-art process-based detectors being heavily susceptible to evasion attacks, no comprehensive solution to this problem is available today. This paper presents Minerva, a new approach to ransomware detection. Unlike current methods focused on identifying ransomware based on process-level behavioral modeling, Minerva detects ransomware by building behavioral profiles of files based on all the operations they receive in a time window. Minerva addresses some of the critical challenges associated with process-based approaches, specifically their vulnerability to complex evasion attacks. Our evaluation of Minerva demonstrates its effectiveness in detecting ransomware attacks, including those that are able to bypass existing defenses. Our results show that Minerva identifies ransomware activity with an average accuracy of 99.45% and an average recall of 99.66%, with 99.97% of ransomware detected within 1 second.  ( 2 min )
    Incomplete Multi-view Clustering via Prototype-based Imputation. (arXiv:2301.11045v1 [cs.LG])
    In this paper, we study how to achieve two characteristics highly-expected by incomplete multi-view clustering (IMvC). Namely, i) instance commonality refers to that within-cluster instances should share a common pattern, and ii) view versatility refers to that cross-view samples should own view-specific patterns. To this end, we design a novel dual-stream model which employs a dual attention layer and a dual contrastive learning loss to learn view-specific prototypes and model the sample-prototype relationship. When the view is missed, our model performs data recovery using the prototypes in the missing view and the sample-prototype relationship inherited from the observed view. Thanks to our dual-stream model, both cluster- and view-specific information could be captured, and thus the instance commonality and view versatility could be preserved to facilitate IMvC. Extensive experiments demonstrate the superiority of our method on six challenging benchmarks compared with 11 approaches. The code will be released.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices. (arXiv:2301.10999v1 [cs.LG])
    The ability to accurately predict deep neural network (DNN) inference performance metrics, such as latency, power, and memory footprint, for an arbitrary DNN on a target hardware platform is essential to the design of DNN based models. This ability is critical for the (manual or automatic) design, optimization, and deployment of practical DNNs for a specific hardware deployment platform. Unfortunately, these metrics are slow to evaluate using simulators (where available) and typically require measurement on the target hardware. This work describes PerfSAGE, a novel graph neural network (GNN) that predicts inference latency, energy, and memory footprint on an arbitrary DNN TFlite graph (TFL, 2017). In contrast, previously published performance predictors can only predict latency and are restricted to pre-defined construction rules or search spaces. This paper also describes the EdgeDLPerf dataset of 134,912 DNNs randomly sampled from four task search spaces and annotated with inference performance metrics from three edge hardware platforms. Using this dataset, we train PerfSAGE and provide experimental results that demonstrate state-of-the-art prediction accuracy with a Mean Absolute Percentage Error of <5% across all targets and model search spaces. These results: (1) Outperform previous state-of-art GNN-based predictors (Dudziak et al., 2020), (2) Accurately predict performance on accelerators (a shortfall of non-GNN-based predictors (Zhang et al., 2021)), and (3) Demonstrate predictions on arbitrary input graphs without modifications to the feature extractor.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Multi-Agent congestion cost minimization with linear function approximation. (arXiv:2301.10993v1 [cs.LG])
    This work considers multiple agents traversing a network from a source node to the goal node. The cost to an agent for traveling a link has a private as well as a congestion component. The agent's objective is to find a path to the goal node with minimum overall cost in a decentralized way. We model this as a fully decentralized multi-agent reinforcement learning problem and propose a novel multi-agent congestion cost minimization (MACCM) algorithm. Our MACCM algorithm uses linear function approximations of transition probabilities and the global cost function. In the absence of a central controller and to preserve privacy, agents communicate the cost function parameters to their neighbors via a time-varying communication network. Moreover, each agent maintains its estimate of the global state-action value, which is updated via a multi-agent extended value iteration (MAEVI) sub-routine. We show that our MACCM algorithm achieves a sub-linear regret. The proof requires the convergence of cost function parameters, the MAEVI algorithm, and analysis of the regret bounds induced by the MAEVI triggering condition for each agent. We implement our algorithm on a two node network with multiple links to validate it. We first identify the optimal policy, the optimal number of agents going to the goal node in each period. We observe that the average regret is close to zero for 2 and 3 agents. The optimal policy captures the trade-off between the minimum cost of staying at a node and the congestion cost of going to the goal node. Our work is a generalization of learning the stochastic shortest path problem.  ( 2 min )
    Time-sensitive Learning for Heterogeneous Federated Edge Intelligence. (arXiv:2301.10977v1 [cs.LG])
    Real-time machine learning has recently attracted significant interest due to its potential to support instantaneous learning, adaptation, and decision making in a wide range of application domains, including self-driving vehicles, intelligent transportation, and industry automation. We investigate real-time ML in a federated edge intelligence (FEI) system, an edge computing system that implements federated learning (FL) solutions based on data samples collected and uploaded from decentralized data networks. FEI systems often exhibit heterogenous communication and computational resource distribution, as well as non-i.i.d. data samples, resulting in long model training time and inefficient resource utilization. Motivated by this fact, we propose a time-sensitive federated learning (TS-FL) framework to minimize the overall run-time for collaboratively training a shared ML model. Training acceleration solutions for both TS-FL with synchronous coordination (TS-FL-SC) and asynchronous coordination (TS-FL-ASC) are investigated. To address straggler effect in TS-FL-SC, we develop an analytical solution to characterize the impact of selecting different subsets of edge servers on the overall model training time. A server dropping-based solution is proposed to allow slow-performance edge servers to be removed from participating in model training if their impact on the resulting model accuracy is limited. A joint optimization algorithm is proposed to minimize the overall time consumption of model training by selecting participating edge servers, local epoch number. We develop an analytical expression to characterize the impact of staleness effect of asynchronous coordination and straggler effect of FL on the time consumption of TS-FL-ASC. Experimental results show that TS-FL-SC and TS-FL-ASC can provide up to 63% and 28% of reduction, in the overall model training time, respectively.  ( 2 min )
    A Fully First-Order Method for Stochastic Bilevel Optimization. (arXiv:2301.10945v1 [math.OC])
    We consider stochastic unconstrained bilevel optimization problems when only the first-order gradient oracles are available. While numerous optimization methods have been proposed for tackling bilevel problems, existing methods either tend to require possibly expensive calculations regarding Hessians of lower-level objectives, or lack rigorous finite-time performance guarantees. In this work, we propose a Fully First-order Stochastic Approximation (F2SA) method, and study its non-asymptotic convergence properties. Specifically, we show that F2SA converges to an $\epsilon$-stationary solution of the bilevel problem after $\epsilon^{-7/2}, \epsilon^{-5/2}$, and $\epsilon^{-3/2}$ iterations (each iteration using $O(1)$ samples) when stochastic noises are in both level objectives, only in the upper-level objective, and not present (deterministic settings), respectively. We further show that if we employ momentum-assisted gradient estimators, the iteration complexities can be improved to $\epsilon^{-5/2}, \epsilon^{-4/2}$, and $\epsilon^{-3/2}$, respectively. We demonstrate even superior practical performance of the proposed method over existing second-order based approaches on MNIST data-hypercleaning experiments.  ( 2 min )
    Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure. (arXiv:2301.10956v1 [cs.LG])
    Graph Neural Networks (GNNs) are popular models for graph learning problems. GNNs show strong empirical performance in many practical tasks. However, the theoretical properties have not been completely elucidated. In this paper, we investigate whether GNNs can exploit the graph structure from the perspective of the expressive power of GNNs. In our analysis, we consider graph generation processes that are controlled by hidden node features, which contain all information about the graph structure. A typical example of this framework is kNN graphs constructed from the hidden features. In our main results, we show that GNNs can recover the hidden node features from the input graph alone, even when all node features, including the hidden features themselves and any indirect hints, are unavailable. GNNs can further use the recovered node features for downstream tasks. These results show that GNNs can fully exploit the graph structure by themselves, and in effect, GNNs can use both the hidden and explicit node features for downstream tasks. In the experiments, we confirm the validity of our results by showing that GNNs can accurately recover the hidden features using a GNN architecture built based on our theoretical analysis.  ( 2 min )
    Visiting Distant Neighbors in Graph Convolutional Networks. (arXiv:2301.10960v1 [cs.LG])
    We extend the graph convolutional network method for deep learning on graph data to higher order in terms of neighboring nodes. In order to construct representations for a node in a graph, in addition to the features of the node and its immediate neighboring nodes, we also include more distant nodes in the calculations. In experimenting with a number of publicly available citation graph datasets, we show that this higher order neighbor visiting pays off by outperforming the original model especially when we have a limited number of available labeled data points for the training of the model.  ( 2 min )
    Privacy-Preserving Joint Edge Association and Power Optimization for the Internet of Vehicles via Federated Multi-Agent Reinforcement Learning. (arXiv:2301.11014v1 [cs.LG])
    Proactive edge association is capable of improving wireless connectivity at the cost of increased handover (HO) frequency and energy consumption, while relying on a large amount of private information sharing required for decision making. In order to improve the connectivity-cost trade-off without privacy leakage, we investigate the privacy-preserving joint edge association and power allocation (JEAPA) problem in the face of the environmental uncertainty and the infeasibility of individual learning. Upon modelling the problem by a decentralized partially observable Markov Decision Process (Dec-POMDP), it is solved by federated multi-agent reinforcement learning (FMARL) through only sharing encrypted training data for federatively learning the policy sought. Our simulation results show that the proposed solution strikes a compelling trade-off, while preserving a higher privacy level than the state-of-the-art solutions.  ( 2 min )
    Neural Dynamic Focused Topic Model. (arXiv:2301.10988v1 [cs.CL])
    Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics. In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the dynamic Focused Topic Model. Indeed, we develop a neural model for topic evolution which exploits sequences of Bernoulli random variables in order to track the appearances of topics, thereby decoupling their activities from their proportions. We evaluate our model on three different datasets (the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset) and show that it (i) outperforms state-of-the-art topic models in generalization tasks and (ii) performs comparably to them on prediction tasks, while employing roughly the same number of parameters, and converging about two times faster. Source code to reproduce our experiments is available online.  ( 2 min )
    On the Importance of Noise Scheduling for Diffusion Models. (arXiv:2301.10972v1 [cs.CV])
    We empirically study the effect of noise scheduling strategies for denoising diffusion generative models. There are three findings: (1) the noise scheduling is crucial for the performance, and the optimal one depends on the task (e.g., image sizes), (2) when increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels), and (3) simply scaling the input data by a factor of $b$ while keeping the noise schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a good strategy across image sizes. This simple recipe, when combined with recently proposed Recurrent Interface Network (RIN), yields state-of-the-art pixel-based diffusion models for high-resolution images on ImageNet, enabling single-stage, end-to-end generation of diverse and high-fidelity images at 1024$\times$1024 resolution for the first time (without upsampling/cascades).  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    SparDA: Accelerating Dynamic Sparse Deep Neural Networks via Sparse-Dense Transformation. (arXiv:2301.10936v1 [cs.LG])
    Due to its high cost-effectiveness, sparsity has become the most important approach for building efficient deep-learning models. However, commodity accelerators are built mainly for efficient dense computation, creating a huge gap for general sparse computation to leverage. Existing solutions have to use time-consuming compiling to improve the efficiency of sparse kernels in an ahead-of-time manner and thus are limited to static sparsity. A wide range of dynamic sparsity opportunities is missed because their sparsity patterns are only known at runtime. This limits the future of building more biological brain-like neural networks that should be dynamically and sparsely activated. In this paper, we bridge the gap between sparse computation and commodity accelerators by proposing a system, called Spider, for efficiently executing deep learning models with dynamic sparsity. We identify an important property called permutation invariant that applies to most deep-learning computations. The property enables Spider (1) to extract dynamic sparsity patterns of tensors that are only known at runtime with little overhead; and (2) to transform the dynamic sparse computation into an equivalent dense computation which has been extremely optimized on commodity accelerators. Extensive evaluation on diverse models shows Spider can extract and transform dynamic sparsity with negligible overhead but brings up to 9.4x speedup over state-of-art solutions.  ( 2 min )
    Affective Faces for Goal-Driven Dyadic Communication. (arXiv:2301.10939v1 [cs.CV])
    We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.  ( 2 min )
    Super-Resolution Analysis via Machine Learning: A Survey for Fluid Flows. (arXiv:2301.10937v1 [physics.flu-dyn])
    This paper surveys machine-learning-based super-resolution reconstruction for vortical flows. Super resolution aims to find the high-resolution flow fields from low-resolution data and is generally an approach used in image reconstruction. In addition to surveying a variety of recent super-resolution applications, we provide case studies of super-resolution analysis for an example of two-dimensional decaying isotropic turbulence. We demonstrate that physics-inspired model designs enable successful reconstruction of vortical flows from spatially limited measurements. We also discuss the challenges and outlooks of machine-learning-based super-resolution analysis for fluid flow applications. The insights gained from this study can be leveraged for super-resolution analysis of numerical and experimental flow data.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Efficient Trust Region-Based Safe Reinforcement Learning with Low-Bias Distributional Actor-Critic. (arXiv:2301.10923v1 [cs.LG])
    To apply reinforcement learning (RL) to real-world applications, agents are required to adhere to the safety guidelines of their respective domains. Safe RL can effectively handle the guidelines by converting them into constraints of the RL problem. In this paper, we develop a safe distributional RL method based on the trust region method, which can satisfy constraints consistently. However, policies may not meet the safety guidelines due to the estimation bias of distributional critics, and importance sampling required for the trust region method can hinder performance due to its significant variance. Hence, we enhance safety performance through the following approaches. First, we train distributional critics to have low estimation biases using proposed target distributions where bias-variance can be traded off. Second, we propose novel surrogates for the trust region method expressed with Q-functions using the reparameterization trick. Additionally, depending on initial policy settings, there can be no policy satisfying constraints within a trust region. To handle this infeasible issue, we propose a gradient integration method which guarantees to find a policy satisfying all constraints from an unsafe initial policy. From extensive experiments, the proposed method with risk-averse constraints shows minimal constraint violations while achieving high returns compared to existing safe RL methods.  ( 2 min )
    Partial advantage estimator for proximal policy optimization. (arXiv:2301.10920v1 [cs.LG])
    Estimation of value in policy gradient methods is a fundamental problem. Generalized Advantage Estimation (GAE) is an exponentially-weighted estimator of an advantage function similar to $\lambda$-return. It substantially reduces the variance of policy gradient estimates at the expense of bias. In practical applications, a truncated GAE is used due to the incompleteness of the trajectory, which results in a large bias during estimation. To address this challenge, instead of using the entire truncated GAE, we propose to take a part of it when calculating updates, which significantly reduces the bias resulting from the incomplete trajectory. We perform experiments in MuJoCo and $\mu$RTS to investigate the effect of different partial coefficient and sampling lengths. We show that our partial GAE approach yields better empirical results in both environments.  ( 2 min )
    SuperFed: Weight Shared Federated Learning. (arXiv:2301.10879v1 [cs.LG])
    Federated Learning (FL) is a well-established technique for privacy preserving distributed training. Much attention has been given to various aspects of FL training. A growing number of applications that consume FL-trained models, however, increasingly operate under dynamically and unpredictably variable conditions, rendering a single model insufficient. We argue for training a global family of models cost efficiently in a federated fashion. Training them independently for different tradeoff points incurs $O(k)$ cost for any k architectures of interest, however. Straightforward applications of FL techniques to recent weight-shared training approaches is either infeasible or prohibitively expensive. We propose SuperFed - an architectural framework that incurs $O(1)$ cost to co-train a large family of models in a federated fashion by leveraging weight-shared learning. We achieve an order of magnitude cost savings on both communication and computation by proposing two novel training mechanisms: (a) distribution of weight-shared models to federated clients, (b) central aggregation of arbitrarily overlapping weight-shared model parameters. The combination of these mechanisms is shown to reach an order of magnitude (9.43x) reduction in computation and communication cost for training a $5*10^{18}$-sized family of models, compared to independently training as few as $k = 9$ DNNs without any accuracy loss.  ( 2 min )
    Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation. (arXiv:2301.10814v1 [q-bio.BM])
    Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Prior work focused on supervised learning methods using a large set of binding affinity data for small molecules, but it is hard to apply the same strategy to other drug classes like antibodies as labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) score matching. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. We evaluate NERE on protein-ligand and antibody-antigen binding affinity prediction benchmarks. Our model outperforms all unsupervised baselines (physics-based and statistical potentials) and matches supervised learning methods in the antibody case.  ( 2 min )
    Joint action loss for proximal policy optimization. (arXiv:2301.10919v1 [cs.LG])
    PPO (Proximal Policy Optimization) is a state-of-the-art policy gradient algorithm that has been successfully applied to complex computer games such as Dota 2 and Honor of Kings. In these environments, an agent makes compound actions consisting of multiple sub-actions. PPO uses clipping to restrict policy updates. Although clipping is simple and effective, it is not efficient in its sample use. For compound actions, most PPO implementations consider the joint probability (density) of sub-actions, which means that if the ratio of a sample (state compound-action pair) exceeds the range, the gradient the sample produces is zero. Instead, for each sub-action we calculate the loss separately, which is less prone to clipping during updates thereby making better use of samples. Further, we propose a multi-action mixed loss that combines joint and separate probabilities. We perform experiments in Gym-$\mu$RTS and MuJoCo. Our hybrid model improves performance by more than 50\% in different MuJoCo environments compared to OpenAI's PPO benchmark results. And in Gym-$\mu$RTS, we find the sub-action loss outperforms the standard PPO approach, especially when the clip range is large. Our findings suggest this method can better balance the use-efficiency and quality of samples.  ( 2 min )
    Distilling Cognitive Backdoor Patterns within an Image. (arXiv:2301.10908v1 [cs.LG])
    This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.  ( 2 min )
    Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL). (arXiv:2301.10888v1 [cs.LG])
    Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different imbalanced rates from different domains. Applied 6 data augmentation methods with 11 machine learning methods testing if the data augmentation will help with improving machine learning performance. We compared the traditional data augmentation evaluation methods with our proposed cross-validation evaluation framework Results Using traditional data augmentation evaluation meta hods will give a false impression of improving the performance. However, our proposed evaluation method shows data augmentation has limited ability to improve the results. Conclusion EFIDL is more suitable for evaluating the prediction performance of an ML method when data are augmented. Using an unsuitable evaluation framework will give false results. Future researchers should consider the evaluation framework we proposed when dealing with augmented datasets. Our experiments showed data augmentation does not help improve ML prediction performance.  ( 2 min )
    GPU-based Private Information Retrieval for On-Device Machine Learning Inference. (arXiv:2301.10904v1 [cs.CR])
    On-device machine learning (ML) inference can enable the use of private user data on user devices without remote servers. However, a pure on-device solution to private ML inference is impractical for many applications that rely on embedding tables that are too large to be stored on-device. To overcome this barrier, we propose the use of private information retrieval (PIR) to efficiently and privately retrieve embeddings from servers without sharing any private information during on-device ML inference. As off-the-shelf PIR algorithms are usually too computationally intensive to directly use for latency-sensitive inference tasks, we 1) develop a novel algorithm for accelerating PIR on GPUs, and 2) co-design PIR with the downstream ML application to obtain further speedup. Our GPU acceleration strategy improves system throughput by more than $20 \times$ over an optimized CPU PIR implementation, and our co-design techniques obtain over $5 \times$ additional throughput improvement at fixed model quality. Together, on various on-device ML applications such as recommendation and language modeling, our system on a single V100 GPU can serve up to $100,000$ queries per second -- a $>100 \times$ throughput improvement over a naively implemented system -- while maintaining model accuracy, and limiting inference communication and response latency to within $300$KB and $<100$ms respectively.  ( 2 min )
    Learning Gradients of Convex Functions with Monotone Gradient Networks. (arXiv:2301.10862v1 [cs.LG])
    While much effort has been devoted to deriving and studying effective convex formulations of signal processing problems, the gradients of convex functions also have critical applications ranging from gradient-based optimization to optimal transport. Recent works have explored data-driven methods for learning convex objectives, but learning their monotone gradients is seldom studied. In this work, we propose Cascaded and Modular Monotone Gradient Networks (C-MGN and M-MGN respectively), two monotone gradient neural network architectures for directly learning the gradients of convex functions. We show that our networks are simpler to train, learn monotone gradient fields more accurately, and use significantly fewer parameters than state of the art methods. We further demonstrate their ability to learn optimal transport mappings to augment driving image data.  ( 2 min )
    Reef-insight: A framework for reef habitat mapping with clustering methods via remote sensing. (arXiv:2301.10876v1 [cs.LG])
    Environmental damage has been of much concern, particularly coastal areas and the oceans given climate change and drastic effects of pollution and extreme climate events. Our present day analytical capabilities along with the advancements in information acquisition techniques such as remote sensing can be utilized for the management and study of coral reef ecosystems. In this paper, we present Reef-insight, an unsupervised machine learning framework that features advanced clustering methods and remote sensing for reef community mapping. Our framework compares different clustering methods to evaluate them for reef community mapping using remote sensing data. We evaluate four major clustering approaches such as k- means, hierarchical clustering, Gaussian mixture model, and density-based clustering based on qualitative and visual assessment. We utilise remote sensing data featuring Heron reef island region in the Great Barrier Reef of Australia. Our results indicate that clustering methods using remote sensing data can well identify benthic and geomorphic clusters that are found in reefs when compared to other studies. Our results indicate that Reef-insight can generate detailed reef community maps outlining distinct reef habitats and has the potential to enable further insights for reef restoration projects. We release our framework as open source software to enable its extension to different parts of the world  ( 2 min )
    Partial Mobilization: Tracking Multilingual Information Flows Amongst Russian Media Outlets and Telegram. (arXiv:2301.10856v1 [cs.CY])
    In response to disinformation and propaganda from Russian online media following the Russian invasion of Ukraine, Russian outlets including Russia Today and Sputnik News were banned throughout Europe. Many of these Russian outlets, in order to reach their audiences, began to heavily promote their content on messaging services like Telegram. In this work, to understand this phenomenon, we study how 16 Russian media outlets have interacted with and utilized 732 Telegram channels throughout 2022. To do this, we utilize a multilingual version of the foundational model MPNet to embed articles and Telegram messages in a shared embedding space and semantically compare content. Leveraging a parallelized version of DP-Means clustering, we perform paragraph-level topic/narrative extraction and time-series analysis with Hawkes Processes. With this approach, across our websites, we find between 2.3% (ura.news) and 26.7% (ukraina.ru) of their content originated/resulted from activity on Telegram. Finally, tracking the spread of individual narratives, we measure the rate at which these websites and channels disseminate content within the Russian media ecosystem.  ( 2 min )
    Automatic Intrinsic Reward Shaping for Exploration in Deep Reinforcement Learning. (arXiv:2301.10886v1 [cs.LG])
    We present AIRS: Automatic Intrinsic Reward Shaping that intelligently and adaptively provides high-quality intrinsic rewards to enhance exploration in reinforcement learning (RL). More specifically, AIRS selects shaping function from a predefined set based on the estimated task return in real-time, providing reliable exploration incentives and alleviating the biased objective problem. Moreover, we develop an intrinsic reward toolkit to provide efficient and reliable implementations of diverse intrinsic reward approaches. We test AIRS on various tasks of Procgen games and DeepMind Control Suite. Extensive simulation demonstrates that AIRS can outperform the benchmarking schemes and achieve superior performance with simple architecture.  ( 2 min )
    When Layers Play the Lottery, all Tickets Win at Initialization. (arXiv:2301.10835v1 [cs.LG])
    Pruning is a standard technique for reducing the computational cost of deep networks. Many advances in pruning leverage concepts from the Lottery Ticket Hypothesis (LTH). LTH reveals that inside a trained dense network exists sparse subnetworks (tickets) able to achieve similar accuracy (i.e., win the lottery - winning tickets). Pruning at initialization focuses on finding winning tickets without training a dense network. Studies on these concepts share the trend that subnetworks come from weight or filter pruning. In this work, we investigate LTH and pruning at initialization from the lens of layer pruning. First, we confirm the existence of winning tickets when the pruning process removes layers. Leveraged by this observation, we propose to discover these winning tickets at initialization, eliminating the requirement of heavy computational resources for training the initial (over-parameterized) dense network. Extensive experiments show that our winning tickets notably speed up the training phase and reduce up to 51% of carbon emission, an important step towards democratization and green Artificial Intelligence. Beyond computational benefits, our winning tickets exhibit robustness against adversarial and out-of-distribution examples. Finally, we show that our subnetworks easily win the lottery at initialization while tickets from filter removal (the standard structured LTH) hardly become winning tickets.  ( 2 min )
    RobustPdM: Designing Robust Predictive Maintenance against Adversarial Attacks. (arXiv:2301.10822v1 [cs.CR])
    The state-of-the-art predictive maintenance (PdM) techniques have shown great success in reducing maintenance costs and downtime of complicated machines while increasing overall productivity through extensive utilization of Internet-of-Things (IoT) and Deep Learning (DL). Unfortunately, IoT sensors and DL algorithms are both prone to cyber-attacks. For instance, DL algorithms are known for their susceptibility to adversarial examples. Such adversarial attacks are vastly under-explored in the PdM domain. This is because the adversarial attacks in the computer vision domain for classification tasks cannot be directly applied to the PdM domain for multivariate time series (MTS) regression tasks. In this work, we propose an end-to-end methodology to design adversarially robust PdM systems by extensively analyzing the effect of different types of adversarial attacks and proposing a novel adversarial defense technique for DL-enabled PdM models. First, we propose novel MTS Projected Gradient Descent (PGD) and MTS PGD with random restarts (PGD_r) attacks. Then, we evaluate the impact of MTS PGD and PGD_r along with MTS Fast Gradient Sign Method (FGSM) and MTS Basic Iterative Method (BIM) on Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and Bi-directional LSTM based PdM system. Our results using NASA's turbofan engine dataset show that adversarial attacks can cause a severe defect (up to 11X) in the RUL prediction, outperforming the effectiveness of the state-of-the-art PdM attacks by 3X. Furthermore, we present a novel approximate adversarial training method to defend against adversarial attacks. We observe that approximate adversarial training can significantly improve the robustness of PdM models (up to 54X) and outperforms the state-of-the-art PdM defense methods by offering 3X more robustness.  ( 2 min )
    Salesforce CausalAI Library: A Fast and Scalable Framework for Causal Analysis of Time Series and Tabular Data. (arXiv:2301.10859v1 [cs.LG])
    We introduce the Salesforce CausalAI Library, an open-source library for causal analysis using observational data. It supports causal discovery and causal inference for tabular and time series data, of both discrete and continuous types. This library includes algorithms that handle linear and non-linear causal relationships between variables, and uses multi-processing for speed-up. We also include a data generator capable of generating synthetic data with specified structural equation model for both the aforementioned data formats and types, that helps users control the ground-truth causal process while investigating various algorithms. Finally, we provide a user interface (UI) that allows users to perform causal analysis on data without coding. The goal of this library is to provide a fast and flexible solution for a variety of problems in the domain of causality. This technical report describes the Salesforce CausalAI API along with its capabilities, the implementations of the supported algorithms, and experiments demonstrating their performance and speed. Our library is available at \url{https://github.com/salesforce/causalai}.  ( 2 min )
    Improved Bitcoin Price Prediction based on COVID-19 data. (arXiv:2301.10840v1 [cs.LG])
    Social turbulence can affect people financial decisions, causing changes in spending and saving. During a global turbulence as significant as the COVID-19 pandemic, such changes are inevitable. Here we examine how the effects of COVID-19 on various jurisdictions influenced the global price of Bitcoin. We hypothesize that lock downs and expectations of economic recession erode people trust in fiat (government-issued) currencies, thus elevating cryptocurrencies. Hence, we expect to identify a causal relation between the turbulence caused by the pandemic, demand for Bitcoin, and ultimately its price. To test the hypothesis, we merged datasets of Bitcoin prices and COVID-19 cases and deaths. We also engineered extra features and applied statistical and machine learning (ML) models. We applied a Random Forest model (RF) to identify and rank the feature importance, and ran a Long Short-Term Memory (LSTM) model on Bitcoin prices data set twice: with and without accounting for COVID-19 related features. We find that adding COVID-19 data into the LSTM model improved prediction of Bitcoin prices.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    Unravelling physics beyond the standard model with classical and quantum anomaly detection. (arXiv:2301.10787v1 [hep-ex])
    Much hope for finding new physics phenomena at microscopic scale relies on the observations obtained from High Energy Physics experiments, like the ones performed at the Large Hadron Collider (LHC). However, current experiments do not indicate clear signs of new physics that could guide the development of additional Beyond Standard Model (BSM) theories. Identifying signatures of new physics out of the enormous amount of data produced at the LHC falls into the class of anomaly detection and constitutes one of the greatest computational challenges. In this article, we propose a novel strategy to perform anomaly detection in a supervised learning setting, based on the artificial creation of anomalies through a random process. For the resulting supervised learning problem, we successfully apply classical and quantum Support Vector Classifiers (CSVC and QSVC respectively) to identify the artificial anomalies among the SM events. Even more promising, we find that employing an SVC trained to identify the artificial anomalies, it is possible to identify realistic BSM events with high accuracy. In parallel, we also explore the potential of quantum algorithms for improving the classification accuracy and provide plausible conditions for the best exploitation of this novel computational paradigm.  ( 2 min )
    Increasing Fairness in Compromise on Accuracy via Weighted Vote with Learning Guarantees. (arXiv:2301.10813v1 [cs.LG])
    As the bias issue is being taken more and more seriously in widely applied machine learning systems, the decrease in accuracy in most cases deeply disturbs researchers when increasing fairness. To address this problem, we present a novel analysis of the expected fairness quality via weighted vote, suitable for both binary and multi-class classification. The analysis takes the correction of biased predictions by ensemble members into account and provides learning bounds that are amenable to efficient minimisation. We further propose a pruning method based on this analysis and the concepts of domination and Pareto optimality, which is able to increase fairness under a prerequisite of little or even no accuracy decline. The experimental results indicate that the proposed learning bounds are faithful and that the proposed pruning method can indeed increase ensemble fairness without much accuracy degradation.  ( 2 min )
    Gene-SGAN: a method for discovering disease subtypes with imaging and genetic signatures via multi-view weakly-supervised deep clustering. (arXiv:2301.10772v1 [q-bio.QM])
    Disease heterogeneity has been a critical challenge for precision diagnosis and treatment, especially in neurologic and neuropsychiatric diseases. Many diseases can display multiple distinct brain phenotypes across individuals, potentially reflecting disease subtypes that can be captured using MRI and machine learning methods. However, biological interpretability and treatment relevance are limited if the derived subtypes are not associated with genetic drivers or susceptibility factors. Herein, we describe Gene-SGAN - a multi-view, weakly-supervised deep clustering method - which dissects disease heterogeneity by jointly considering phenotypic and genetic data, thereby conferring genetic correlations to the disease subtypes and associated endophenotypic signatures. We first validate the generalizability, interpretability, and robustness of Gene-SGAN in semi-synthetic experiments. We then demonstrate its application to real multi-site datasets from 28,858 individuals, deriving subtypes of Alzheimer's disease and brain endophenotypes associated with hypertension, from MRI and SNP data. Derived brain phenotypes displayed significant differences in neuroanatomical patterns, genetic determinants, biological and clinical biomarkers, indicating potentially distinct underlying neuropathologic processes, genetic drivers, and susceptibility factors. Overall, Gene-SGAN is broadly applicable to disease subtyping and endophenotype discovery, and is herein tested on disease-related, genetically-driven neuroimaging phenotypes.  ( 2 min )
    Quantum anomaly detection in the latent space of proton collision events at the LHC. (arXiv:2301.10780v1 [quant-ph])
    We propose a new strategy for anomaly detection at the LHC based on unsupervised quantum machine learning algorithms. To accommodate the constraints on the problem size dictated by the limitations of current quantum hardware we develop a classical convolutional autoencoder. The designed quantum anomaly detection models, namely an unsupervised kernel machine and two clustering algorithms, are trained to find new-physics events in the latent representation of LHC data produced by the autoencoder. The performance of the quantum algorithms is benchmarked against classical counterparts on different new-physics scenarios and its dependence on the dimensionality of the latent space and the size of the training dataset is studied. For kernel-based anomaly detection, we identify a regime where the quantum model significantly outperforms its classical counterpart. An instance of the kernel machine is implemented on a quantum computer to verify its suitability for available hardware. We demonstrate that the observed consistent performance advantage is related to the inherent quantum properties of the circuit used.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )
    Generative Tertiary Structure-based RNA Design. (arXiv:2301.10774v1 [q-bio.BM])
    Learning from 3D biological macromolecules with artificial intelligence technologies has been an emerging area. Computational protein design, known as the inverse of protein structure prediction, aims to generate protein sequences that will fold into the defined structure. Analogous to protein design, RNA design is also an important topic in synthetic biology, which aims to generate RNA sequences by given structures. However, existing RNA design methods mainly focus on the secondary structure, ignoring the informative tertiary structure, which is commonly used in protein design. To explore the complex coupling between RNA sequence and 3D structure, we introduce an RNA tertiary structure modeling method to efficiently capture useful information from the 3D structure of RNA. For a fair comparison, we collect abundant RNA data and split the data according to tertiary structures. With the standard dataset, we conduct a benchmark by employing structure-based protein design approaches with our RNA tertiary structure modeling method. We believe our work will stimulate the future development of tertiary structure-based RNA design and bridge the gap between the RNA 3D structures and sequences.  ( 2 min )
  • Open

    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v3 [cs.LG] UPDATED)
    The two key characteristics of a normalizing flow is that it is invertible (in particular, dimension preserving) and that it monitors the amount by which it changes the likelihood of data points as samples are propagated along the network. Recently, multiple generalizations of normalizing flows have been introduced that relax these two conditions. On the other hand, neural networks only perform a forward pass on the input, there is neither a notion of an inverse of a neural network nor is there one of its likelihood contribution. In this paper we argue that certain neural network architectures can be enriched with a stochastic inverse pass and that their likelihood contribution can be monitored in a way that they fall under the generalized notion of a normalizing flow mentioned above. We term this enrichment flowification. We prove that neural networks only containing linear layers, convolutional layers and invertible activations such as LeakyReLU can be flowified and evaluate them in the generative setting on image datasets.  ( 2 min )
    Smoothed Online Learning for Prediction in Piecewise Affine Systems. (arXiv:2301.11187v1 [stat.ML])
    The problem of piecewise affine (PWA) regression and planning is of foundational importance to the study of online learning, control, and robotics, where it provides a theoretically and empirically tractable setting to study systems undergoing sharp changes in the dynamics. Unfortunately, due to the discontinuities that arise when crossing into different ``pieces,'' learning in general sequential settings is impossible and practical algorithms are forced to resort to heuristic approaches. This paper builds on the recently developed smoothed online learning framework and provides the first algorithms for prediction and simulation in PWA systems whose regret is polynomial in all relevant problem parameters under a weak smoothness assumption; moreover, our algorithms are efficient in the number of calls to an optimization oracle. We further apply our results to the problems of one-step prediction and multi-step simulation regret in piecewise affine dynamical systems, where the learner is tasked with simulating trajectories and regret is measured in terms of the Wasserstein distance between simulated and true data. Along the way, we develop several technical tools of more general interest.  ( 2 min )
    Two-step interpretable modeling of Intensive Care Acquired Infections. (arXiv:2301.11146v1 [stat.AP])
    We present a novel methodology for integrating high resolution longitudinal data with the dynamic prediction capabilities of survival models. The aim is two-fold: to improve the predictive power while maintaining interpretability of the models. To go beyond the black box paradigm of artificial neural networks, we propose a parsimonious and robust semi-parametric approach (i.e., a landmarking competing risks model) that combines routinely collected low-resolution data with predictive features extracted from a convolutional neural network, that was trained on high resolution time-dependent information. We then use saliency maps to analyze and explain the extra predictive power of this model. To illustrate our methodology, we focus on healthcare-associated infections in patients admitted to an intensive care unit.  ( 2 min )
    Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons. (arXiv:2301.11270v1 [cs.LG])
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the $K$-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and Max Entropy Inverse Reinforcement Learning, and provide the first sample complexity bound for both problems.  ( 2 min )
    WL meet VC. (arXiv:2301.11039v1 [cs.LG])
    Recently, many works studied the expressive power of graph neural networks (GNNs) by linking it to the $1$-dimensional Weisfeiler--Leman algorithm ($1\text{-}\mathsf{WL}$). Here, the $1\text{-}\mathsf{WL}$ is a well-studied heuristic for the graph isomorphism problem, which iteratively colors or partitions a graph's vertex set. While this connection has led to significant advances in understanding and enhancing GNNs' expressive power, it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. In this paper, we study GNNs' generalization ability through the lens of Vapnik--Chervonenkis (VC) dimension theory in two settings, focusing on graph-level predictions. First, when no upper bound on the graphs' order is known, we show that the bitlength of GNNs' weights tightly bounds their VC dimension. Further, we derive an upper bound for GNNs' VC dimension using the number of colors produced by the $1\text{-}\mathsf{WL}$. Secondly, when an upper bound on the graphs' order is known, we show a tight connection between the number of graphs distinguishable by the $1\text{-}\mathsf{WL}$ and GNNs' VC dimension. Our empirical study confirms the validity of our theoretical findings.  ( 2 min )
    Extending Adversarial Attacks to Produce Adversarial Class Probability Distributions. (arXiv:2004.06383v3 [cs.LG] UPDATED)
    Despite the remarkable performance and generalization levels of deep learning models in a wide range of artificial intelligence tasks, it has been demonstrated that these models can be easily fooled by the addition of imperceptible yet malicious perturbations to natural inputs. These altered inputs are known in the literature as adversarial examples. In this paper, we propose a novel probabilistic framework to generalize and extend adversarial attacks in order to produce a desired probability distribution for the classes when we apply the attack method to a large number of inputs. This novel attack paradigm provides the adversary with greater control over the target model, thereby exposing, in a wide range of scenarios, threats against deep learning models that cannot be conducted by the conventional paradigms. We introduce four different strategies to efficiently generate such attacks, and illustrate our approach by extending multiple adversarial attack algorithms. We also experimentally validate our approach for the spoken command classification task and the Tweet emotion classification task, two exemplary machine learning problems in the audio and text domain, respectively. Our results demonstrate that we can closely approximate any probability distribution for the classes while maintaining a high fooling rate and even prevent the attacks from being detected by label-shift detection methods.  ( 2 min )
    Maximum Optimality Margin: A Unified Approach for Contextual Linear Programming and Inverse Linear Programming. (arXiv:2301.11260v1 [cs.LG])
    In this paper, we study the predict-then-optimize problem where the output of a machine learning prediction task is used as the input of some downstream optimization problem, say, the objective coefficient vector of a linear program. The problem is also known as predictive analytics or contextual linear programming. The existing approaches largely suffer from either (i) optimization intractability (a non-convex objective function)/statistical inefficiency (a suboptimal generalization bound) or (ii) requiring strong condition(s) such as no constraint or loss calibration. We develop a new approach to the problem called \textit{maximum optimality margin} which designs the machine learning loss function by the optimality condition of the downstream optimization. The max-margin formulation enjoys both computational efficiency and good theoretical properties for the learning procedure. More importantly, our new approach only needs the observations of the optimal solution in the training data rather than the objective function, which makes it a new and natural approach to the inverse linear programming problem under both contextual and context-free settings; we also analyze the proposed method under both offline and online settings, and demonstrate its performance using numerical experiments.  ( 2 min )
    Inspecting class hierarchies in classification-based metric learning models. (arXiv:2301.11065v1 [cs.LG])
    Most classification models treat all misclassifications equally. However, different classes may be related, and these hierarchical relationships must be considered in some classification problems. These problems can be addressed by using hierarchical information during training. Unfortunately, this information is not available for all datasets. Many classification-based metric learning methods use class representatives in embedding space to represent different classes. The relationships among the learned class representatives can then be used to estimate class hierarchical structures. If we have a predefined class hierarchy, the learned class representatives can be assessed to determine whether the metric learning model learned semantic distances that match our prior knowledge. In this work, we train a softmax classifier and three metric learning models with several training options on benchmark and real-world datasets. In addition to the standard classification accuracy, we evaluate the hierarchical inference performance by inspecting learned class representatives and the hierarchy-informed performance, i.e., the classification performance, and the metric learning performance by considering predefined hierarchical structures. Furthermore, we investigate how the considered measures are affected by various models and training options. When our proposed ProxyDR model is trained without using predefined hierarchical structures, the hierarchical inference performance is significantly better than that of the popular NormFace model. Additionally, our model enhances some hierarchy-informed performance measures under the same training options. We also found that convolutional neural networks (CNNs) with random weights correspond to the predefined hierarchies better than random chance.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v2 [stat.ML] UPDATED)
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as "uncertain evidence." We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method "distributional evidence" as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which one interpretation is defined as "correct." We then compare inference results from each different interpretation illustrating the importance of careful consideration of uncertain evidence.  ( 2 min )
    Your diffusion model secretly knows the dimension of the data manifold. (arXiv:2212.12611v2 [cs.LG] UPDATED)
    In this work, we propose a novel framework for estimating the dimension of the data manifold using a trained diffusion model. A diffusion model approximates the score function i.e. the gradient of the log density of a noise-corrupted version of the target distribution for varying levels of corruption. If the data concentrates around a manifold embedded in the high-dimensional ambient space, then as the level of corruption decreases, the score function points towards the manifold, as this direction becomes the direction of maximal likelihood increase. Therefore, for small levels of corruption, the diffusion model provides us with access to an approximation of the normal bundle of the data manifold. This allows us to estimate the dimension of the tangent space, thus, the intrinsic dimension of the data manifold. To the best of our knowledge, our method is the first deep-learning based estimator of the data manifold dimension and it outperforms well established statistical estimators in controlled experiments on both Euclidean and image data.  ( 2 min )
    Neural Continuous-Discrete State Space Models for Irregularly-Sampled Time Series. (arXiv:2301.11308v1 [cs.LG])
    Learning accurate predictive models of real-world dynamic phenomena (e.g., climate, biological) remains a challenging task. One key issue is that the data generated by both natural and artificial processes often comprise time series that are irregularly sampled and/or contain missing observations. In this work, we propose the Neural Continuous-Discrete State Space Model (NCDSSM) for continuous-time modeling of time series through discrete-time observations. NCDSSM employs auxiliary variables to disentangle recognition from dynamics, thus requiring amortized inference only for the auxiliary variables. Leveraging techniques from continuous-discrete filtering theory, we demonstrate how to perform accurate Bayesian inference for the dynamic states. We propose three flexible parameterizations of the latent dynamics and an efficient training objective that marginalizes the dynamic states during inference. Empirical results on multiple benchmark datasets across various domains show improved imputation and forecasting performance of NCDSSM over existing models.  ( 2 min )
    Causal Inference with Hidden Mediators. (arXiv:2111.02927v2 [math.ST] UPDATED)
    Proximal causal inference was recently proposed as a framework to identify causal effects from observational data in the presence of hidden confounders for which proxies are available. In this paper, we extend the proximal causal inference approach to settings where identification of causal effects hinges upon a set of mediators which are not observed, yet error prone proxies of the hidden mediators are measured. Specifically, (i) We establish causal hidden mediation analysis, which extends classical causal mediation analysis methods for identifying natural direct and indirect effects under no unmeasured confounding to a setting where the mediator of interest is hidden, but proxies of it are available. (ii) We establish hidden front-door criterion, which extends the classical front-door criterion to allow for hidden mediators for which proxies are available. (iii) We show that the identification of a certain causal effect called population intervention indirect effect remains possible with hidden mediators in settings where challenges in (i) and (ii) might co-exist. We view (i)-(iii) as important steps towards the practical application of front-door criteria and mediation analysis as mediators are almost always measured with error and thus, the most one can hope for in practice is that the measurements are at best proxies of mediating mechanisms. We propose identification approaches for the parameters of interest in our considered models. For the estimation aspect, we propose an influence function-based estimation method and provide an analysis for the robustness of the estimators.  ( 2 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v5 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 2 min )
    On the Dissipation of Ideal Hamiltonian Monte Carlo Sampler. (arXiv:2209.07438v2 [stat.CO] UPDATED)
    We report on what seems to be an intriguing connection between variable integration time and partial velocity refreshment of Ideal Hamiltonian Monte Carlo samplers, both of which can be used for reducing the dissipative behavior of the dynamics. More concretely, we show that on quadratic potentials, efficiency can be improved through these means by a $\sqrt{\kappa}$ factor in Wasserstein-2 distance, compared to classical constant integration time, fully refreshed HMC. We additionally explore the benefit of randomized integrators for simulating the Hamiltonian dynamics under higher order regularity conditions.  ( 2 min )
    Conformal Prediction for Trustworthy Detection of Railway Signals. (arXiv:2301.11136v1 [stat.ML])
    We present an application of conformal prediction, a form of uncertainty quantification with guarantees, to the detection of railway signals. State-of-the-art architectures are tested and the most promising one undergoes the process of conformalization, where a correction is applied to the predicted bounding boxes (i.e. to their height and width) such that they comply with a predefined probability of success. We work with a novel exploratory dataset of images taken from the perspective of a train operator, as a first step to build and validate future trustworthy machine learning models for the detection of railway signals.  ( 2 min )
    Scale-Free Adversarial Multi-Armed Bandit with Arbitrary Feedback Delays. (arXiv:2110.13400v3 [cs.LG] UPDATED)
    We consider the Scale-Free Adversarial Multi-Armed Bandit (MAB) problem with unrestricted feedback delays. In contrast to the standard assumption that all losses are $[0,1]$-bounded, in our setting, losses can fall in a general bounded interval $[-L, L]$, unknown to the agent beforehand. Furthermore, the feedback of each arm pull can experience arbitrary delays. We propose a novel approach named Scale-Free Delayed INF (SFD-INF) for this novel setting, which combines a recent "convex combination trick" together with a novel doubling and skipping technique. We then present two instances of SFD-INF, each with carefully designed delay-adapted learning scales. The first one SFD-TINF uses $\frac 12$-Tsallis entropy regularizer and can achieve $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret when the losses are non-negative, where $K$ is the number of actions, $T$ is the number of steps, and $D$ is the total feedback delay. This bound nearly matches the $\Omega((\sqrt{KT}+\sqrt{D\log K})L)$ lower-bound when regarding $K$ as a constant independent of $T$. The second one, SFD-LBINF, works for general scale-free losses and achieves a small-loss style adaptive regret bound $\widetilde{\mathcal O}(\sqrt{K\mathbb{E}[\tilde{\mathfrak L}_T^2]}+\sqrt{KDL})$, which falls to the $\widetilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret in the worst case and is thus more general than SFD-TINF despite a more complicated analysis and several extra logarithmic dependencies. Moreover, both instances also outperform the existing algorithms for non-delayed (i.e., $D=0$) scale-free adversarial MAB problems, which can be of independent interest.  ( 2 min )
    Evaluating Probabilistic Classifiers: The Triptych. (arXiv:2301.10803v1 [stat.ME])
    Probability forecasts for binary outcomes, often referred to as probabilistic classifiers or confidence scores, are ubiquitous in science and society, and methods for evaluating and comparing them are in great demand. We propose and study a triptych of diagnostic graphics that focus on distinct and complementary aspects of forecast performance: The reliability diagram addresses calibration, the receiver operating characteristic (ROC) curve diagnoses discrimination ability, and the Murphy diagram visualizes overall predictive performance and value. A Murphy curve shows a forecast's mean elementary scores, including the widely used misclassification rate, and the area under a Murphy curve equals the mean Brier score. For a calibrated forecast, the reliability curve lies on the diagonal, and for competing calibrated forecasts, the ROC and Murphy curves share the same number of crossing points. We invoke the recently developed CORP (Consistent, Optimally binned, Reproducible, and Pool-Adjacent-Violators (PAV) algorithm based) approach to craft reliability diagrams and decompose a mean score into miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components. Plots of the DSC measure of discrimination ability versus the calibration metric MCB visualize classifier performance across multiple competitors. The proposed tools are illustrated in empirical examples from astrophysics, economics, and social science.  ( 2 min )
    Banker Online Mirror Descent. (arXiv:2106.08943v2 [cs.LG] UPDATED)
    We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.  ( 2 min )
    Proximal Causal Learning of Heterogeneous Treatment Effects. (arXiv:2301.10913v1 [stat.ML])
    Efficiently and flexibly estimating treatment effect heterogeneity is an important task in a wide variety of settings ranging from medicine to marketing, and there are a considerable number of promising conditional average treatment effect estimators currently available. These, however, typically rely on the assumption that the measured covariates are enough to justify conditional exchangeability. We propose the P-learner, motivated by the R-learner, a tailored two-stage loss function for learning heterogeneous treatment effects in settings where exchangeability given observed covariates is an implausible assumption, and we wish to rely on proxy variables for causal inference. Our proposed estimator can be implemented by off-the-shelf loss-minimizing machine learning methods, which in the case of kernel regression satisfies an oracle bound on the estimated error as long as the nuisance components are estimated reasonably well.  ( 2 min )
    KSD Aggregated Goodness-of-fit Test. (arXiv:2202.00824v5 [stat.ML] UPDATED)
    We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide non-asymptotic guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded model score function, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.  ( 2 min )
    Coin Sampling: Gradient-Based Bayesian Inference without Learning Rates. (arXiv:2301.11294v1 [stat.ML])
    In recent years, particle-based variational inference (ParVI) methods such as Stein variational gradient descent (SVGD) have grown in popularity as scalable methods for Bayesian inference. Unfortunately, the properties of such methods invariably depend on hyperparameters such as the learning rate, which must be carefully tuned by the practitioner in order to ensure convergence to the target measure at a suitable rate. In this paper, we introduce a suite of new particle-based methods for scalable Bayesian inference based on coin betting, which are entirely learning-rate free. We illustrate the performance of our approach on a range of numerical examples, including several high-dimensional models and datasets, demonstrating comparable performance to other ParVI algorithms.  ( 2 min )
    Re-embedding data to strengthen recovery guarantees of clustering. (arXiv:2301.10901v1 [cs.LG])
    We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.  ( 2 min )
    Granger Causal Chain Discovery for Sepsis-Associated Derangements via Multivariate Hawkes Processes. (arXiv:2209.04480v4 [stat.AP] UPDATED)
    Modern health care systems are conducting continuous, automated surveillance of the electronic medical record (EMR) to identify adverse events with increasing frequency; however, many events such as sepsis do not have elucidated prodromes (i.e., event chains) that can be used to identify and intercept the adverse event early in its course. Currently, there does not exist reliable framework for discovering or describing causal chains that precede adverse hospital events. Clinically relevant and interpretable results require a framework that can (1) infer temporal interactions across multiple patient features found in EMR data (e.g., labs, vital signs, etc.) and (2) can identify patterns that precede and are specific to an impending adverse event (e.g., sepsis). In this work, we propose a linear multivariate Hawkes process model, coupled with ReLU link function, to recover a Granger Causal (GC) graph with both exciting and inhibiting effects. We develop a scalable two-phase gradient-based method to maximize a surrogate-likelihood and estimate the problem parameters, which is shown to be effective via extensive numerical simulation. Our method is subsequently extended to a data set of patients admitted to an academic level 1 trauma center located in Atalanta, GA, where the estimated GC graph identifies several highly interpretable chains that precede sepsis. Here, we demonstrate the effectiveness of our approach in learning a GC graph over Sepsis Associated Derangements (SADs), but it can be generalized to other applications with similar requirements.  ( 2 min )
    Efficient Aggregated Kernel Tests using Incomplete $U$-statistics. (arXiv:2206.09194v3 [stat.ML] UPDATED)
    We propose a series of computationally efficient nonparametric tests for the two-sample, independence, and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over the more widespread permutation-based approach, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, the linear-time versions of our proposed tests perform at least as well as the current linear-time state-of-the-art tests.  ( 2 min )
    Bayesian Detection of Mesoscale Structures in Pathway Data on Graphs. (arXiv:2301.11120v1 [stat.ME])
    Mesoscale structures are an integral part of the abstraction and analysis of complex systems. They reveal a node's function in the network, and facilitate our understanding of the network dynamics. For example, they can represent communities in social or citation networks, roles in corporate interactions, or core-periphery structures in transportation networks. We usually detect mesoscale structures under the assumption of independence of interactions. Still, in many cases, the interactions invalidate this assumption by occurring in a specific order. Such patterns emerge in pathway data; to capture them, we have to model the dependencies between interactions using higher-order network models. However, the detection of mesoscale structures in higher-order networks is still under-researched. In this work, we derive a Bayesian approach that simultaneously models the optimal partitioning of nodes in groups and the optimal higher-order network dynamics between the groups. In synthetic data we demonstrate that our method can recover both standard proximity-based communities and role-based groupings of nodes. In synthetic and real world data we show that it can compete with baseline techniques, while additionally providing interpretable abstractions of network dynamics.  ( 2 min )
    Minimax estimation of discontinuous optimal transport maps: The semi-discrete case. (arXiv:2301.11302v1 [math.ST])
    We consider the problem of estimating the optimal transport map between two probability distributions, $P$ and $Q$ in $\mathbb R^d$, on the basis of i.i.d. samples. All existing statistical analyses of this problem require the assumption that the transport map is Lipschitz, a strong requirement that, in particular, excludes any examples where the transport map is discontinuous. As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution $Q$ is a discrete measure supported on a finite number of points in $\mathbb R^d$. We study a computationally efficient estimator initially proposed by Pooladian and Niles-Weed (2021), based on entropic optimal transport, and show in the semi-discrete setting that it converges at the minimax-optimal rate $n^{-1/2}$, independent of dimension. Other standard map estimation techniques both lack finite-sample guarantees in this setting and provably suffer from the curse of dimensionality. We confirm these results in numerical experiments, and provide experiments for other settings, not covered by our theory, which indicate that the entropic estimator is a promising methodology for other discontinuous transport map estimation problems.  ( 2 min )
    Returning The Favour: When Regression Benefits From Probabilistic Causal Knowledge. (arXiv:2301.11214v1 [stat.ML])
    A directed acyclic graph (DAG) provides valuable prior knowledge that is often discarded in regression tasks in machine learning. We show that the independences arising from the presence of collider structures in DAGs provide meaningful inductive biases, which constrain the regression hypothesis space and improve predictive performance. We introduce collider regression, a framework to incorporate probabilistic causal knowledge from a collider in a regression problem. When the hypothesis space is a reproducing kernel Hilbert space, we prove a strictly positive generalisation benefit under mild assumptions and provide closed-form estimators of the empirical risk minimiser. Experiments on synthetic and climate model data demonstrate performance gains of the proposed methodology.  ( 2 min )
    Learning from Mistakes: Self-Regularizing Hierarchical Semantic Representations in Point Cloud Segmentation. (arXiv:2301.11145v1 [cs.CV])
    Recent advances in autonomous robotic technologies have highlighted the growing need for precise environmental analysis. LiDAR semantic segmentation has gained attention to accomplish fine-grained scene understanding by acting directly on raw content provided by sensors. Recent solutions showed how different learning techniques can be used to improve the performance of the model, without any architectural or dataset change. Following this trend, we present a coarse-to-fine setup that LEArns from classification mistaKes (LEAK) derived from a standard model. First, classes are clustered into macro groups according to mutual prediction errors; then, the learning process is regularized by: (1) aligning class-conditional prototypical feature representation for both fine and coarse classes, (2) weighting instances with a per-class fairness index. Our LEAK approach is very general and can be seamlessly applied on top of any segmentation architecture; indeed, experimental results showed that it enables state-of-the-art performances on different architectures, datasets and tasks, while ensuring more balanced class-wise results and faster convergence.  ( 2 min )
    Learning Large Scale Sparse Models. (arXiv:2301.10958v1 [stat.ML])
    In this work, we consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions. Two immediate issues occur under such challenging scenario: (i) computational cost; (ii) memory overhead. In particular, the memory issue precludes a large volume of prior algorithms that are based on batch optimization technique. To remedy the problem, we propose to learn sparse models such as Lasso in an online manner where in each iteration, only one randomly chosen sample is revealed to update a sparse iterate. Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient. Perhaps amazingly, we find that with the same parameter, sparsity promoted by batch methods is not preserved in online fashion. We analyze such interesting phenomenon and illustrate some effective variants including mini-batch methods and a hard thresholding based stochastic gradient algorithm. Extensive experiments are carried out on a public dataset which supports our findings and algorithms.  ( 2 min )
    On the inconsistency of separable losses for structured prediction. (arXiv:2301.10810v1 [cs.LG])
    In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, or, in other words, minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.  ( 2 min )
    simple diffusion: End-to-end diffusion for high resolution images. (arXiv:2301.11093v1 [cs.CV])
    Currently, applying diffusion models in pixel space of high resolution images is difficult. Instead, existing approaches focus on diffusion in lower dimensional spaces (latent diffusion), or have multiple super-resolution levels of generation referred to as cascades. The downside is that these approaches add additional complexity to the diffusion framework. This paper aims to improve denoising diffusion for high resolution images while keeping the model as simple as possible. The paper is centered around the research question: How can one train a standard denoising diffusion models on high resolution images, and still obtain performance comparable to these alternate approaches? The four main findings are: 1) the noise schedule should be adjusted for high resolution images, 2) It is sufficient to scale only a particular part of the architecture, 3) dropout should be added at specific locations in the architecture, and 4) downsampling is an effective strategy to avoid high resolution feature maps. Combining these simple yet effective techniques, we achieve state-of-the-art on image generation among diffusion models without sampling modifiers on ImageNet.  ( 2 min )
    Graph Encoder Ensemble for Simultaneous Vertex Embedding and Community Detection. (arXiv:2301.11290v1 [cs.SI])
    In this paper we propose a novel and computationally efficient method to simultaneously achieve vertex embedding, community detection, and community size determination. By utilizing a normalized one-hot graph encoder and a new rank-based cluster size measure, the proposed graph encoder ensemble algorithm achieves excellent numerical performance throughout a variety of simulations and real data experiments.  ( 2 min )
    Random Grid Neural Processes for Parametric Partial Differential Equations. (arXiv:2301.11040v1 [cs.LG])
    We introduce a new class of spatially stochastic physics and data informed deep latent models for parametric partial differential equations (PDEs) which operate through scalable variational neural processes. We achieve this by assigning probability measures to the spatial domain, which allows us to treat collocation grids probabilistically as random variables to be marginalised out. Adapting this spatial statistics view, we solve forward and inverse problems for parametric PDEs in a way that leads to the construction of Gaussian process models of solution fields. The implementation of these random grids poses a unique set of challenges for inverse physics informed deep learning frameworks and we propose a new architecture called Grid Invariant Convolutional Networks (GICNets) to overcome these challenges. We further show how to incorporate noisy data in a principled manner into our physics informed model to improve predictions for problems where data may be available but whose measurement location does not coincide with any fixed mesh or grid. The proposed method is tested on a nonlinear Poisson problem, Burgers equation, and Navier-Stokes equations, and we provide extensive numerical comparisons. We demonstrate significant computational advantages over current physics informed neural learning methods for parametric PDEs while improving the predictive capabilities and flexibility of these models.  ( 2 min )
    On the Global Convergence of Risk-Averse Policy Gradient Methods with Dynamic Time-Consistent Risk Measures. (arXiv:2301.10932v1 [cs.LG])
    Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.  ( 2 min )
    Graph Neural Tangent Kernel: Convergence on Large Graphs. (arXiv:2301.10808v1 [cs.LG])
    Graph neural networks (GNNs) achieve remarkable performance in graph machine learning tasks but can be hard to train on large-graph data, where their learning dynamics are not well understood. We investigate the training dynamics of large-graph GNNs using graph neural tangent kernels (GNTKs) and graphons. In the limit of large width, optimization of an overparametrized NN is equivalent to kernel regression on the NTK. Here, we investigate how the GNTK evolves as another independent dimension is varied: the graph size. We use graphons to define limit objects -- graphon NNs for GNNs, and graphon NTKs for GNTKs, and prove that, on a sequence of growing graphs, the GNTKs converge to the graphon NTK. We further prove that the eigenspaces of the GNTK, which are related to the problem learning directions and associated learning speeds, converge to the spectrum of the GNTK. This implies that in the large-graph limit, the GNTK fitted on a graph of moderate size can be used to solve the same task on the large-graph and infer the learning dynamics of the large-graph GNN. These results are verified empirically on node regression and node classification tasks.  ( 2 min )

  • Open

    ChatGPT can definitely print Russian propaganda including why Prime Minister Justin Trudeau should be charged with treason despite its Wikipedia page
    submitted by /u/Robinsonc1988 [link] [comments]  ( 40 min )
    Text-To-4D Dynamic Scene Generation
    submitted by /u/bperki8 [link] [comments]  ( 40 min )
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more!
    Bright Eye: mobile AI app that generates art, code, poems, essays, short stories, answers questions, and more! Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’re one of the first mobile multipurpose AI apps. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA(Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature, storylines, answer questions and more, (think of chatgpt). We’d love to have some people try it out, give us feedback, and keep in touch with us. https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/SonnyDoge22 [link] [comments]  ( 41 min )
    We need AI to take over all creative mediums and hobbies
    Before you quick draw shoot the messenger here, I want you to hear me out. This is coming from a person who enjoys drawing digital art and a person who used to make music for fun. Humans tend to be very selfish creatures, not in an evil sense, but rather a "phew, glad that was you and not me" sense, like if someone stepped in a pile of of dog poop that sucks for them, but it's not your problem. Once everybodies creative (not labor intenseive) livelyhoods get taken, then yeah, I could see how now AI replicating humanity could be a problem. Then maybe after that, people can enjoy not being replaced. That is all I have to say, god bless my reddit karma, I can feel this subreddit getting ready to dislike bomb this, but as long as this message gets out, that's all I want. submitted by /u/Zima_Re-L [link] [comments]  ( 41 min )
    AI Dream 150 - MY INCREDIBLE DREAM VISUALIZED BY AI - Part3 TEASER - AI ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Humanity May Reach Singularity Within Just 7 Years, Trend Shows
    submitted by /u/Tao_Dragon [link] [comments]  ( 40 min )
    VoiceGPT - ChatGPT Voice Assistant
    submitted by /u/nickbild [link] [comments]  ( 40 min )
    The Big Tech Royal Rumble for AI
    submitted by /u/foundersblock [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Google MusicLM turns language into music
    submitted by /u/Number_5_alive [link] [comments]  ( 40 min )
    Looking to Convert Photos to Video - Anything Out There?
    I want to be able to upload several photos (10 - 100) and have the AI generate a video using those photos. I've seen Genmo, but that only uses one photo at a time. Is there anything that may be able to do this? Thanks submitted by /u/venicerocco [link] [comments]  ( 40 min )
    Freaky A.I concept..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    fully AI made video i found on yt
    https://www.youtube.com/watch?v=Vw-t826JcDQ submitted by /u/Optimal_Studio_2050 [link] [comments]  ( 40 min )
    Is there any AI that is able to decipher the lyrics of a song?
    Specifically this one: https://www.youtube.com/watch?v=MFv7apjatwM&ab_channel=Lux-Topic If there is no current AI that is able to listen to a song and write down the lyrics accurately, then I provide this idea freely. submitted by /u/A_Very_Horny_Zed [link] [comments]  ( 40 min )
    📌[Searchcolab] Voice Cloning + Image Processing + Lip Syncing. Link in comments
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 40 min )
    Pix2Pix AI Model Inside Stable Diffusion Installation Guide!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    What will happen to the internet once ChatGPT disintermediates websites?
    If ChatGPT and AI search engines become the central place for acquiring knowledge, essentially scraping and disintermediating websites; what commercial incentive will many website owners have to generate new content once their ad revenue is gone? What will happen to the internet? submitted by /u/DelPrive235 [link] [comments]  ( 41 min )
    🧬 The Age of A.I., Longevity and Biotech - Is a "Synthetic biology singularity" coming?
    submitted by /u/BackgroundResult [link] [comments]  ( 41 min )
    Outsmarting AI Detection Tools: How to Make Your AI-Generated Content Fly Under the Radar
    submitted by /u/PapaDudu [link] [comments]  ( 43 min )
    Don’t Chat With ChatGPT: Amazon’s Warning To Employees
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    what is the best way to upscale comics? BSRGAN doesn't work well with text.
    submitted by /u/mhczbnoykrqvzazfth [link] [comments]  ( 41 min )
    📌[Searchcolab] Text-To-4D Dynamic Scene Generation.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    Why is AI better at visual art than music generation?
    I've just been wondering why people have been able to develop AI models such as GANs and more recently diffusion models that are so good at creating images but music generation has remained somewhat stagnant. From what I understand, there are two ways AI are generally trained on music: through MIDI files and using raw audio. The MIDI files are probably easier for the AI to work with but there's massive loss of relevant data that an AI would need to really become proficient in music generation. On the other hand, AIs trained on the raw audio tend to produce somewhat fuzzy audio that often includes what sounds like nearly distinguishable lyrics. I'm personally convinced the latter is the method that will be universally used in the future, but it clearly has a long way to go. So does anyone have any insights regarding why it's straggling a bit? Perhaps it's because software like Stable Diffusion are piggybacking off the image recognition advancements that have been in very high demand the past couple decades? Perhaps music is, in fact, a more complicated or more strict artform and our human minds are biased to think that these two tasks should be roughly equivalent in difficulty? It's certainly not due to a lack of data to train on, so maybe we just don't have a model that's really suited to analyzing and imitating music. submitted by /u/No-Phrase1116 [link] [comments]  ( 44 min )
  • Open

    MusicLM: Generating Music From Text
    submitted by /u/nickb [link] [comments]  ( 40 min )
    🚀Rodin: 3D Avatars Using Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
  • Open

    Pygame window does not closes and kernel freezes
    I was rendering 'MountainCar-v0' environment in "human" render_mode. When I call env.close(), my pygame window doesn't closes automatically and I need to Force Quit it, and unfortunately my kernel dies. I'm using Jupyter Notebook. Following are my versions: PC: Mac OS Ventura 13.1 - Python: Python 3.9.9 - Jupyter:jupyter 1.0.0 - gym: 0.26.2 - ipykernel: 6.6.0 - ipython: 7.30.1 ​ Please let me know how I can solve this issue, Thanks! submitted by /u/Character-Yellow-796 [link] [comments]  ( 41 min )
    Tuning hyperparamters in RL.
    Hello everyone. I have a question about hyperparameter tuning for PPO. usually, when we tune them ( using Optuna for example), each set of hyperparameters is tested for a small number of steps ( 150000 for example ) then we pick the ones that yielded the best reward. Yet, in the long run, the asymptotic convergence might not be reached by the best hyperparameters found during tuning but rather, other hyperparameters that performed worse during tuning are the ones that might results in better asymptotic convergence. Is there any way to overcome this issue and maybe find the best hyperparameters without having to test for a large number of steps. Note: I'm using a custom environment with a continuous action space and image observations. submitted by /u/Many_Reception_4921 [link] [comments]  ( 42 min )
    Model-based hierarchical reinforcement learning
    Hi, do you know papers that combine model-based and hierarchical reinforcement learning, where also the lower level is a model-based approach? I cannot find sufficient paper about it submitted by /u/aika98oe [link] [comments]  ( 42 min )
  • Open

    [D] Monitoring and Retraining Models with Label-Changing Interventions
    When a trained ML model is implemented to predict an adverse event, the user might take steps to avoid that event. The outcome of the user's actions could either be successful or failure. In training with strictly observational data, a typical confusion matrix contains: ŷ=0, y=0 -> True Negative ŷ=0, y=1 -> False Negative ŷ=1, y=1 -> True Positive ŷ=1, y=0 -> False Positive When using the model, some results get confounded if the user acts based on the predictions ŷ=0, y=0, a=0 -> True Negative, No Intervention ŷ=0, y=1, a=0 -> False Negative, No Intervention ŷ=1, y=1, a=1 -> True Positive, Failed Intervention ŷ=1, y=0, a=1 -> False Positive OR Successful Intervention Ignoring the possibility that the intervention caused the adverse event, the involvement of the user may lead to an increase in the number of false positives that are perceived. Continuous monitoring becomes difficult due to perceived faster degradation of the model. Furthermore, retraining the model in the future may be hindered by labels that do not accurately reflect the true values. One approach that I've been proposed is to make sure there is always a hold-out set. Allow some random records to get scores, but do never act on them. This gives both a monitoring and retraining dataset. Are there other solutions that people use here? I've found the papers below, but I cannot say that I completely understand how to practically implement them. Monitoring machine learning (ML)-based risk prediction algorithms in the presence of confounding medical interventions (https://arxiv.org/abs/2211.09781) Model updating after interventions paradoxically introduces bias (https://arxiv.org/abs/2010.11530) submitted by /u/waiting4omscs [link] [comments]  ( 43 min )
    [P] Using algorithms or models from papers for commercial use
    Hey! I am reading the GET3D paper by Nvidia. The paper is listed with the Nvidia license which states: 3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use non-commercially. The Work or derivative works thereof may be used or intended for use by Nvidia or its affiliates commercially or non-commercially. As used herein, "non-commercially" means for research or evaluation purposes only and not for any direct or indirect monetary gain. Does it mean there is no commercial way of using the ideas in the paper? Is it possible to use the ideas from that paper or any other paper by Nvidia in some product? As the idea from the paper is only the tool or a part of the product but is not the product itself. submitted by /u/romantimm25 [link] [comments]  ( 44 min )
    [D] Google Predoctoral Program (India) 2023
    Has anyone got any interview email? Did they start the interview process? submitted by /u/Around-star [link] [comments]  ( 42 min )
    [D] ImageNet2012 Advice
    I'm currently at the point in my PhD career that I've developed some extremely successful components of CNNs, architecture, activation, etc. Outperforming default choices on CIFAR10, CIFAR100, Flowers, Caltech101, and other smaller datasets. With how success the results currently are, we want to publish to a top tier conference, specifically NeurIPS this Spring, deadline around May 13th. However, we (me and my advisor) agree that to publish at NeurIPS, our developments need to be backed up by ImageNet. The problem is that we have never trained on ImageNet before (so no experience), and have a limited computational budget. Although our university personally owns 2 A100 40 GB GPUs that we can use, they are shared within the entire university, so a 2 day job takes about 1 week in queue (don't know if we can get the results in time by May). On the other hand, we also don't know if we can get a $2500 grant in time to use cloud resources. For those who have trained on ImageNet, what are some common pitfalls, best ways to transfer data, downloading the dataset, etc? If you performed it on the cloud, how did you do so? How long was your time to train? Expenses? Did you run each model once or three times? Early stopping using validation or test set? NOTE: We will only be using Tensorflow... submitted by /u/MyActualUserName99 [link] [comments]  ( 44 min )
    [D] Best large language model for Named Entity Extraction?
    I'd like to extract named entities, something like this: "[Text]: Microsoft (the word being a portmanteau of "microcomputer software") was founded by Bill Gates on April 4, 1975, to develop and sell BASIC interpreters for the Altair 8800. Steve Ballmer replaced Gates as CEO in 2000, and later envisioned a "devices and services" strategy. [Name]: Steve Ballmer [Position]: CEO [Company]: Microsoft " Tried it on GPT-Neox with 20b parameters with mixed success, is there anything better out there to try for a few-shot learning (without fine tuning)? submitted by /u/TankAttack [link] [comments]  ( 44 min )
    [R] ETLP: Event-based Three-factor Local Plasticity for online learning with neuromorphic hardware
    Neuromorphic perception with event-based sensors, asynchronous hardware and spiking neurons is showing promising results for real-time and energy-efficient inference in embedded systems. The next promise of brain-inspired computing is to enable adaptation to changes at the edge with online learning. However, the parallel and distributed architectures of neuromorphic hardware based on co-localized compute and memory imposes locality constraints to the on-chip learning rules. We propose in this work the Event-based Three-factor Local Plasticity (ETLP) rule that uses (1) the pre-synaptic spike trace, (2) the post-synaptic membrane voltage and (3) a third factor in the form of projected labels with no error calculation, that also serve as update triggers. We apply ETLP with feedforward and recurrent spiking neural networks on visual and auditory event-based pattern recognition, and compare it to Back-Propagation Through Time (BPTT) and eProp. We show a competitive performance in accuracy with a clear advantage in the computational complexity for ETLP. We also show that when using local plasticity, threshold adaptation in spiking neurons and a recurrent topology are necessary to learn spatio-temporal patterns with a rich temporal structure. Finally, we provide a proof of concept hardware implementation of ETLP on FPGA to highlight the simplicity of its computational primitives and how they can be mapped into neuromorphic hardware for online learning with low-energy consumption and real-time interaction. Full paper: https://arxiv.org/abs/2301.08281 submitted by /u/ferquinve [link] [comments]  ( 43 min )
    [D] MusicLM: Generating Music From Text
    How far do you think this can go? Is it a memorization machine or can it create new songs? https://google-research.github.io/seanet/musiclm/examples/ submitted by /u/carlthome [link] [comments]  ( 44 min )
    [R] SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
    Large Language Models (LLMs) from the Generative Pretrained Transformer (GPT) family have shown remarkable performance on a wide range of tasks, but are difficult to deploy because of their massive size and computational costs. For instance, the top-performing GPT-175B model has 175 billion parameters, which total at least 320GB (counting multiples of 1024) of storage in half-precision (FP16) format, leading it to require at least five A100 GPUs with 80GB of memory each for inference. It is therefore natural that there has been significant interest in reducing these costs via model compression. To date, virtually all existing GPT compression approaches have focused on quantization, that is, reducing the precision of the numerical representation of individual weights. A complementary approa…  ( 47 min )
    [D] Meta AI Residency 2023
    Now that apps are closed did anyone hear back yet? please follow this thread and update your status below, tbh I don't really think I have much of a chance but I'm excited none the less. ​ to follow a thread please press the bell on the top right submitted by /u/BeautyInUgly [link] [comments]  ( 42 min )
    [D] Moving away from Unicode for more equal token representation across global languages?
    Edit: as has been explained in the comments, unicode is not the issue so much as the byte-pair encoding scheme, which artificially limits the vocabulary size of the model and leads to less common language using more tokens. I'd like to discuss the impacts of increasing the vocabulary size on transformer model computational requirements. Many languages, like Chinese, Japanese Kanji, Korean, Telugu, etc use complex logograms to represent words and concepts. Unfortunately, these languages are severely "punished" in GPT3 because they are expensive to tokenize due to the way unicode represents them. Instead of unicode representing them as a single code point, logograms are typically represented as a sum of multiple graphemes, meaning that multiple unicode code points underlie their descriptio…  ( 49 min )
    [D] Why are there no End2End Speech Recognition models using the same Encoder-Decoder learning process as BART (no CTC) ?
    I'm new to CTC. After learning about CTC and its application in End2End training for Speech Recognition, I figured that if we want to generate a target sequence (transcript), given a source sequence features, we could use the vanilla Encoder-Decoder architecture in Transformer (also used in T5, BART, etc) alone, without the need of CTC, yet why people are only using CTC for End2End Speech Recoginition, or using hybrid of CTC and Decoder in some papers ? Thanks. submitted by /u/KarmaCut132 [link] [comments]  ( 43 min )
  • Open

    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.  ( 3 min )
    The Devil is the Classifier: Investigating Long Tail Relation Classification with Decoupling Analysis. (arXiv:2009.07022v1 [cs.LG] CROSS LISTED)
    Long-tailed relation classification is a challenging problem as the head classes may dominate the training phase, thereby leading to the deterioration of the tail performance. Existing solutions usually address this issue via class-balancing strategies, e.g., data re-sampling and loss re-weighting, but all these methods adhere to the schema of entangling learning of the representation and classifier. In this study, we conduct an in-depth empirical investigation into the long-tailed problem and found that pre-trained models with instance-balanced sampling already capture the well-learned representations for all classes; moreover, it is possible to achieve better long-tailed classification ability at low cost by only adjusting the classifier. Inspired by this observation, we propose a robust classifier with attentive relation routing, which assigns soft weights by automatically aggregating the relations. Extensive experiments on two datasets demonstrate the effectiveness of our proposed approach. Code and datasets are available in https://github.com/zjunlp/deepke.  ( 2 min )
    Contrastive Triple Extraction with Generative Transformer. (arXiv:2009.06207v8 [cs.CL] CROSS LISTED)
    Triple extraction is an essential task in information extraction for natural language processing and knowledge graph construction. In this paper, we revisit the end-to-end triple extraction task for sequence generation. Since generative triple extraction may struggle to capture long-term dependencies and generate unfaithful triples, we introduce a novel model, contrastive triple extraction with a generative transformer. Specifically, we introduce a single shared transformer module for encoder-decoder-based generation. To generate faithful results, we propose a novel triplet contrastive training object. Moreover, we introduce two mechanisms to further improve model performance (i.e., batch-wise dynamic attention-masking and triple-wise calibration). Experimental results on three datasets (i.e., NYT, WebNLG, and MIE) show that our approach achieves better performance than that of baselines.  ( 2 min )
    Interaction Modeling with Multiplex Attention. (arXiv:2208.10660v2 [cs.LG] UPDATED)
    Modeling multi-agent systems requires understanding how agents interact. Such systems are often difficult to model because they can involve a variety of types of interactions that layer together to drive rich social behavioral dynamics. Here we introduce a method for accurately modeling multi-agent systems. We present Interaction Modeling with Multiplex Attention (IMMA), a forward prediction model that uses a multiplex latent graph to represent multiple independent types of interactions and attention to account for relations of different strengths. We also introduce Progressive Layer Training, a training strategy for this architecture. We show that our approach outperforms state-of-the-art models in trajectory forecasting and relation inference, spanning three multi-agent scenarios: social navigation, cooperative task achievement, and team sports. We further demonstrate that our approach can improve zero-shot generalization and allows us to probe how different interactions impact agent behavior.  ( 2 min )
    Deep learning in a bilateral brain with hemispheric specialization. (arXiv:2209.06862v6 [q-bio.NC] UPDATED)
    The brains of all bilaterally symmetric animals on Earth are divided into left and right hemispheres. The anatomy and functionality of the hemispheres have a large degree of overlap, but there are asymmetries and they specialize to possess different attributes. Several studies have used computational models to mimic hemispheric asymmetries with a focus on reproducing human data on semantic and visual processing tasks. In this study, we aimed to understand how dual hemispheres could interact in a given task. We propose a bilateral artificial neural network that imitates a lateralization observed in nature: that the left hemisphere specializes in specificity and the right in generalities. We used two ResNet-9 convolutional neural networks with different training objectives and tested it on an image classification task. Our analysis found that the hemispheres represent complementary features that are exploited by a network head which implements a type of weighted attention. The bilateral architecture outperformed a range of baselines of similar representational capacity that don't exploit differential specialization. The results demonstrate the efficacy of bilateralism, contribute to an understanding of bilateralism in biological brains and the architecture serves as an inductive bias when designing new AI systems.  ( 2 min )
    Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks. (arXiv:1903.01306v1 [cs.IR] CROSS LISTED)
    We propose a distance supervised relation extraction approach for long-tailed, imbalanced data which is prevalent in real-world settings. Here, the challenge is to learn accurate "few-shot" models for classes existing at the tail of the class distribution, for which little data is available. Inspired by the rich semantic correlations between classes at the long tail and those at the head, we take advantage of the knowledge from data-rich classes at the head of the distribution to boost the performance of the data-poor classes at the tail. First, we propose to leverage implicit relational knowledge among class labels from knowledge graph embeddings and learn explicit relational knowledge using graph convolution networks. Second, we integrate that relational knowledge into relation extraction model by coarse-to-fine knowledge-aware attention mechanism. We demonstrate our results for a large-scale benchmark dataset which show that our approach significantly outperforms other baselines, especially for long-tail relations.  ( 2 min )
    Towards Dynamic Stability Assessment of Power Grid Topologies using Graph Neural Networks. (arXiv:2206.06369v3 [cs.LG] UPDATED)
    To mitigate climate change, the share of renewable energies in power production needs to be increased. Renewables introduce new challenges to power grids regarding the dynamic stability due to decentralization, reduced inertia and volatility in production. Since dynamic stability simulations are intractable and exceedingly expensive for large grids, graph neural networks (GNNs) are a promising method to reduce the computational effort of analyzing dynamic stability of power grids. We provide new datasets of dynamic stability of synthetic power grids and find that GNNs are surprisingly effective at predicting the highly non-linear targets from topological information only. Furthermore, we use GNNs to demonstrate the accurate identification of particularly vulnerable nodes in power grids, so-called troublemakers. Lastly, we find that GNNs trained on small grids generate accurate predictions on a large synthetic model of the Texan power grid, which illustrates the potential for real-world applications of the presented approach.  ( 2 min )
    Why the pseudo label based semi-supervised learning algorithm is effective?. (arXiv:2211.10039v2 [cs.LG] UPDATED)
    Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are N labeled data. (2) There are N unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.  ( 2 min )
    ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning. (arXiv:2102.12828v3 [cs.CL] CROSS LISTED)
    This paper presents our systems for the three Subtasks of SemEval Task4: Reading Comprehension of Abstract Meaning (ReCAM). We explain the algorithms used to learn our models and the process of tuning the algorithms and selecting the best model. Inspired by the similarity of the ReCAM task and the language pre-training, we propose a simple yet effective technology, namely, negative augmentation with language model. Evaluation results demonstrate the effectiveness of our proposed approach. Our models achieve the 4th rank on both official test sets of Subtask 1 and Subtask 2 with an accuracy of 87.9% and an accuracy of 92.8%, respectively. We further conduct comprehensive model analysis and observe interesting error cases, which may promote future researches.  ( 2 min )
    Ultra-NeRF: Neural Radiance Fields for Ultrasound Imaging. (arXiv:2301.10520v1 [eess.IV])
    We present a physics-enhanced implicit neural representation (INR) for ultrasound (US) imaging that learns tissue properties from overlapping US sweeps. Our proposed method leverages a ray-tracing-based neural rendering for novel view US synthesis. Recent publications demonstrated that INR models could encode a representation of a three-dimensional scene from a set of two-dimensional US frames. However, these models fail to consider the view-dependent changes in appearance and geometry intrinsic to US imaging. In our work, we discuss direction-dependent changes in the scene and show that a physics-inspired rendering improves the fidelity of US image synthesis. In particular, we demonstrate experimentally that our proposed method generates geometrically accurate B-mode images for regions with ambiguous representation owing to view-dependent differences of the US images. We conduct our experiments using simulated B-mode US sweeps of the liver and acquired US sweeps of a spine phantom tracked with a robotic arm. The experiments corroborate that our method generates US frames that enable consistent volume compounding from previously unseen views. To the best of our knowledge, the presented work is the first to address view-dependent US image synthesis using INR.
    Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. (arXiv:2108.13161v7 [cs.CL] CROSS LISTED)
    Large-scale pre-trained language models have contributed significantly to natural language processing by demonstrating remarkable abilities as few-shot learners. However, their effectiveness depends mainly on scaling the model parameters and prompt design, hindering their implementation in most real-world applications. This study proposes a novel pluggable, extensible, and efficient approach named DifferentiAble pRompT (DART), which can convert small language models into better few-shot learners without any prompt engineering. The main principle behind this approach involves reformulating potential natural language processing tasks into the task of a pre-trained language model and differentially optimizing the prompt template as well as the target label with backpropagation. Furthermore, the proposed approach can be: (i) Plugged to any pre-trained language models; (ii) Extended to widespread classification tasks. A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance. Code is available in https://github.com/zjunlp/DART.
    PromptKG: A Prompt Learning Framework for Knowledge Graph Representation Learning and Application. (arXiv:2210.00305v1 [cs.CL] CROSS LISTED)
    Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information. KG representation models should consider graph structures and text semantics, but no comprehensive open-sourced framework is mainly designed for KG regarding informative text description. In this paper, we present PromptKG, a prompt learning framework for KG representation learning and application that equips the cutting-edge text-based methods, integrates a new prompt learning model and supports various tasks (e.g., knowledge graph completion, question answering, recommendation, and knowledge probing). PromptKG is publicly open-sourced at https://github.com/zjunlp/PromptKG with long-term technical support.
    Conceptualized Representation Learning for Chinese Biomedical Text Mining. (arXiv:2008.10813v1 [cs.CL] CROSS LISTED)
    Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.
    Normal vs. Adversarial: Salience-based Analysis of Adversarial Samples for Relation Extraction. (arXiv:2104.00312v4 [cs.CL] CROSS LISTED)
    Recent neural-based relation extraction approaches, though achieving promising improvement on benchmark datasets, have reported their vulnerability towards adversarial attacks. Thus far, efforts mostly focused on generating adversarial samples or defending adversarial attacks, but little is known about the difference between normal and adversarial samples. In this work, we take the first step to leverage the salience-based method to analyze those adversarial samples. We observe that salience tokens have a direct correlation with adversarial perturbations. We further find the adversarial perturbations are either those tokens not existing in the training set or superficial cues associated with relation labels. To some extent, our approach unveils the characters against adversarial samples. We release an open-source testbed, "DiagnoseAdv" in https://github.com/zjunlp/DiagnoseAdv.
    Reasoning Through Memorization: Nearest Neighbor Knowledge Graph Embeddings. (arXiv:2201.05575v2 [cs.CL] CROSS LISTED)
    Previous knowledge graph embedding approaches usually map entities to representations and utilize score functions to predict the target entities, yet they struggle to reason rare or emerging unseen entities. In this paper, we propose kNN-KGE, a new knowledge graph embedding approach with pre-trained language models, by linearly interpolating its entity distribution with k-nearest neighbors. We compute the nearest neighbors based on the distance in the entity embedding space from the knowledge store. Our approach can allow rare or emerging entities to be memorized explicitly rather than implicitly in model parameters. Experimental results demonstrate that our approach can improve inductive and transductive link prediction results and yield better performance for low-resource settings with only a few triples, which might be easier to reason via explicit memory. Code is available at https://github.com/zjunlp/KNN-KG.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Logic-Based Explainability in Machine Learning. (arXiv:2211.00541v2 [cs.AI] UPDATED)
    The last decade witnessed an ever-increasing stream of successes in Machine Learning (ML). These successes offer clear evidence that ML is bound to become pervasive in a wide range of practical uses, including many that directly affect humans. Unfortunately, the operation of the most successful ML models is incomprehensible for human decision makers. As a result, the use of ML models, especially in high-risk and safety-critical settings is not without concern. In recent years, there have been efforts on devising approaches for explaining ML models. Most of these efforts have focused on so-called model-agnostic approaches. However, all model-agnostic and related approaches offer no guarantees of rigor, hence being referred to as non-formal. For example, such non-formal explanations can be consistent with different predictions, which renders them useless in practice. This paper overviews the ongoing research efforts on computing rigorous model-based explanations of ML models; these being referred to as formal explanations. These efforts encompass a variety of topics, that include the actual definitions of explanations, the characterization of the complexity of computing explanations, the currently best logical encodings for reasoning about different ML models, and also how to make explanations interpretable for human decision makers, among others.
    Disentangled Contrastive Learning for Learning Robust Textual Representations. (arXiv:2104.04907v2 [cs.CL] CROSS LISTED)
    Although the self-supervised pre-training of transformer models has resulted in the revolutionizing of natural language processing (NLP) applications and the achievement of state-of-the-art results with regard to various benchmarks, this process is still vulnerable to small and imperceptible permutations originating from legitimate inputs. Intuitively, the representations should be similar in the feature space with subtle input permutations, while large variations occur with different meanings. This motivates us to investigate the learning of robust textual representation in a contrastive manner. However, it is non-trivial to obtain opposing semantic instances for textual samples. In this study, we propose a disentangled contrastive learning method that separately optimizes the uniformity and alignment of representations without negative sampling. Specifically, we introduce the concept of momentum representation consistency to align features and leverage power normalization while conforming the uniformity. Our experimental results for the NLP benchmarks demonstrate that our approach can obtain better results compared with the baselines, as well as achieve promising improvements with invariance tests and adversarial attacks. The code is available in https://github.com/zxlzr/DCL.
    Multi-Agent Deep Reinforcement Learning for Efficient Passenger Delivery in Urban Air Mobility. (arXiv:2211.06890v2 [cs.MA] UPDATED)
    It has been considered that urban air mobility (UAM), also known as drone-taxi or electrical vertical takeoff and landing (eVTOL), will play a key role in future transportation. By putting UAM into practical future transportation, several benefits can be realized, i.e., (i) the total travel time of passengers can be reduced compared to traditional transportation and (ii) there is no environmental pollution and no special labor costs to operate the system because electric batteries will be used in UAM system. However, there are various dynamic and uncertain factors in the flight environment, i.e., passenger sudden service requests, battery discharge, and collision among UAMs. Therefore, this paper proposes a novel cooperative MADRL algorithm based on centralized training and distributed execution (CTDE) concepts for reliable and efficient passenger delivery in UAM networks. According to the performance evaluation results, we confirm that the proposed algorithm outperforms other existing algorithms in terms of the number of serviced passengers increase (30%) and the waiting time per serviced passenger decrease (26%).
    Self-Supervised Hierarchical Metrical Structure Modeling. (arXiv:2210.17183v2 [cs.SD] UPDATED)
    We propose a novel method to model hierarchical metrical structures for both symbolic music and audio signals in a self-supervised manner with minimal domain knowledge. The model trains and inferences on beat-aligned music signals and predicts an 8-layer hierarchical metrical tree from beat, measure to the section level. The training procedure does not require any hierarchical metrical labeling except for beats, purely relying on the nature of metrical regularity and inter-voice consistency as inductive biases. We show in experiments that the method achieves comparable performance with supervised baselines on multiple metrical structure analysis tasks on both symbolic music and audio signals. All demos, source code and pre-trained models are publicly available on GitHub.
    On Robustness and Bias Analysis of BERT-based Relation Extraction. (arXiv:2009.06206v5 [cs.CL] CROSS LISTED)
    Fine-tuning pre-trained models have achieved impressive performance on standard natural language processing benchmarks. However, the resultant model generalizability remains poorly understood. We do not know, for example, how excellent performance can lead to the perfection of generalization models. In this study, we analyze a fine-tuned BERT model from different perspectives using relation extraction. We also characterize the differences in generalization techniques according to our proposed improvements. From empirical experimentation, we find that BERT suffers a bottleneck in terms of robustness by way of randomizations, adversarial and counterfactual tests, and biases (i.e., selection and semantic). These findings highlight opportunities for future improvements. Our open-sourced testbed DiagnoseRE is available in \url{https://github.com/zjunlp/DiagnoseRE}.
    Kformer: Knowledge Injection in Transformer Feed-Forward Layers. (arXiv:2201.05742v2 [cs.CL] CROSS LISTED)
    Recent days have witnessed a diverse set of knowledge injection models for pre-trained language models (PTMs); however, most previous studies neglect the PTMs' own ability with quantities of implicit knowledge stored in parameters. A recent study has observed knowledge neurons in the Feed Forward Network (FFN), which are responsible for expressing factual knowledge. In this work, we propose a simple model, Kformer, which takes advantage of the knowledge stored in PTMs and external knowledge via knowledge injection in Transformer FFN layers. Empirically results on two knowledge-intensive tasks, commonsense reasoning (i.e., SocialIQA) and medical question answering (i.e., MedQA-USMLE), demonstrate that Kformer can yield better performance than other knowledge injection technologies such as concatenation or attention-based injection. We think the proposed simple model and empirical findings may be helpful for the community to develop more powerful knowledge injection methods. Code available in https://github.com/zjunlp/Kformer.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    On the Probability of Necessity and Sufficiency of Explaining Graph Neural Networks: A Lower Bound Optimization Approach. (arXiv:2212.07056v2 [cs.LG] UPDATED)
    The explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a heuristic trade-off between the two. Theoretically, the Probability of Necessity and Sufficiency (PNS) can be applied to search for the most necessary and sufficient explanation since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limit its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples and leverage continuous masks with a sampling strategy to optimize the lower bound. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
    Learning to Ask for Data-Efficient Event Argument Extraction. (arXiv:2110.00479v1 [cs.CL] CROSS LISTED)
    Event argument extraction (EAE) is an important task for information extraction to discover specific argument roles. In this study, we cast EAE as a question-based cloze task and empirically analyze fixed discrete token template performance. As generating human-annotated question templates is often time-consuming and labor-intensive, we further propose a novel approach called "Learning to Ask," which can learn optimized question templates for EAE without human annotations. Experiments using the ACE-2005 dataset demonstrate that our method based on optimized questions achieves state-of-the-art performance in both the few-shot and supervised settings.
    Communication-Efficient Diffusion Strategy for Performance Improvement of Federated Learning with Non-IID Data. (arXiv:2207.07493v3 [cs.DC] UPDATED)
    Federated learning (FL) is a novel learning paradigm that addresses the privacy leakage challenge of centralized learning. However, in FL, users with non-independent and identically distributed (non-IID) characteristics can deteriorate the performance of the global model. Specifically, the global model suffers from the weight divergence challenge owing to non-IID data. To address the aforementioned challenge, we propose a novel diffusion strategy of the machine learning (ML) model (FedDif) to maximize the FL performance with non-IID data. In FedDif, users spread local models to neighboring users over D2D communications. FedDif enables the local model to experience different distributions before parameter aggregation. Furthermore, we theoretically demonstrate that FedDif can circumvent the weight divergence challenge. On the theoretical basis, we propose the communication-efficient diffusion strategy of the ML model, which can determine the trade-off between the learning performance and communication cost based on auction theory. The performance evaluation results show that FedDif improves the test accuracy of the global model by 10.37% compared to the baseline FL with non-IID settings. Moreover, FedDif improves the number of consumed sub-frames by 1.28 to 2.85 folds to the latest methods except for the model compression scheme. FedDif also improves the number of transmitted models by 1.43 to 2.67 folds to the latest methods.
    Relation Adversarial Network for Low Resource Knowledge Graph Completion. (arXiv:1911.03091v6 [cs.CL] CROSS LISTED)
    Knowledge Graph Completion (KGC) has been proposed to improve Knowledge Graphs by filling in missing connections via link prediction or relation extraction. One of the main difficulties for KGC is a low resource problem. Previous approaches assume sufficient training triples to learn versatile vectors for entities and relations, or a satisfactory number of labeled sentences to train a competent relation extraction model. However, low resource relations are very common in KGs, and those newly added relations often do not have many known samples for training. In this work, we aim at predicting new facts under a challenging setting where only limited training instances are available. We propose a general framework called Weighted Relation Adversarial Network, which utilizes an adversarial procedure to help adapt knowledge/features learned from high resource relations to different but related low resource relations. Specifically, the framework takes advantage of a relation discriminator to distinguish between samples from different relations, and help learn relation-invariant features more transferable from source relations to target relations. Experimental results show that the proposed approach outperforms previous methods regarding low resource settings for both link prediction and relation extraction.
    Schema-aware Reference as Prompt Improves Data-Efficient Relational Triple and Event Extraction. (arXiv:2210.10709v3 [cs.CL] CROSS LISTED)
    Information Extraction, which aims to extract structural relational triple or event from unstructured texts, often suffers from data scarcity issues. With the development of pre-trained language models, many prompt-based approaches to data-efficient information extraction have been proposed and achieved impressive performance. However, existing prompt learning methods for information extraction are still susceptible to several potential limitations: (i) semantic gap between natural language and output structure knowledge with pre-defined schema; (ii) representation learning with locally individual instances limits the performance given the insufficient features. In this paper, we propose a novel approach of schema-aware Reference As Prompt (RAP), which dynamically leverage schema and knowledge inherited from global (few-shot) training data for each sample. Specifically, we propose a schema-aware reference store, which unifies symbolic schema and relevant textual instances. Then, we employ a dynamic reference integration module to retrieve pertinent knowledge from the datastore as prompts during training and inference. Experimental results demonstrate that RAP can be plugged into various existing models and outperforms baselines in low-resource settings on four datasets of relational triple extraction and event extraction. In addition, we provide comprehensive empirical ablations and case analysis regarding different types and scales of knowledge in order to better understand the mechanisms of RAP. Code is available in https://github.com/zjunlp/RAP.
    LOGEN: Few-shot Logical Knowledge-Conditioned Text Generation with Self-training. (arXiv:2112.01404v2 [cs.CL] CROSS LISTED)
    Natural language generation from structured data mainly focuses on surface-level descriptions, suffering from uncontrollable content selection and low fidelity. Previous works leverage logical forms to facilitate logical knowledge-conditioned text generation. Though achieving remarkable progress, they are data-hungry, which makes the adoption for real-world applications challenging with limited data. To this end, this paper proposes a unified framework for logical knowledge-conditioned text generation in the few-shot setting. With only a few seeds logical forms (e.g., 20/100 shot), our approach leverages self-training and samples pseudo logical forms based on content and structure consistency. Experimental results demonstrate that our approach can obtain better few-shot performance than baselines.
    Document-level Relation Extraction as Semantic Segmentation. (arXiv:2106.03618v2 [cs.CL] CROSS LISTED)
    Document-level relation extraction aims to extract relations among multiple entity pairs from a document. Previously proposed graph-based or transformer-based models utilize the entities independently, regardless of global information among relational triples. This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information, parallel to the semantic segmentation task in computer vision. Herein, we propose a Document U-shaped Network for document-level relation extraction. Specifically, we leverage an encoder module to capture the context information of entities and a U-shaped segmentation module over the image-style feature map to capture global interdependency among triples. Experimental results show that our approach can obtain state-of-the-art performance on three benchmark datasets DocRED, CDR, and GDA.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    Variance-Reduced Conservative Policy Iteration. (arXiv:2212.06283v2 [cs.LG] UPDATED)
    We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $\varepsilon$-functional local optimum from $O(\varepsilon^{-4})$ to $O(\varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $\varepsilon$-global optimality after sampling $O(\varepsilon^{-2})$ times, improving upon the previously established $O(\varepsilon^{-3})$ sample requirement.
    Linear TreeShap. (arXiv:2209.08192v2 [cs.LG] UPDATED)
    Decision trees are well-known due to their ease of interpretability. To improve accuracy, we need to grow deep trees or ensembles of trees. These are hard to interpret, offsetting their original benefits. Shapley values have recently become a popular way to explain the predictions of tree-based machine learning models. It provides a linear weighting to features independent of the tree structure. The rise in popularity is mainly due to TreeShap, which solves a general exponential complexity problem in polynomial time. Following extensive adoption in the industry, more efficient algorithms are required. This paper presents a more efficient and straightforward algorithm: Linear TreeShap. Like TreeShap, Linear TreeShap is exact and requires the same amount of memory.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    A Sequential Deep Learning Algorithm for Sampled Mixed-integer Optimisation Problems. (arXiv:2301.10703v1 [math.OC])
    Mixed-integer optimisation problems can be computationally challenging. Here, we introduce and analyse two efficient algorithms with a specific sequential design that are aimed at dealing with sampled problems within this class. At each iteration step of both algorithms, we first test the feasibility of a given test solution for each and every constraint associated with the sampled optimisation at hand, while also identifying those constraints that are violated. Subsequently, an optimisation problem is constructed with a constraint set consisting of the current basis -- namely the smallest set of constraints that fully specifies the current test solution -- as well as constraints related to a limited number of the identified violating samples. We show that both algorithms exhibit finite-time convergence towards the optimal solution. Algorithm 2 features a neural network classifier that notably improves the computational performance compared to Algorithm 1. We establish quantitatively the efficacy of these algorithms by means of three numerical tests: robust optimal power flow, robust unit commitment, and robust random mixed-integer linear program.
    Obstacle Identification and Ellipsoidal Decomposition for Fast Motion Planning in Unknown Dynamic Environments. (arXiv:2209.14233v2 [cs.RO] UPDATED)
    Collision avoidance in the presence of dynamic obstacles in unknown environments is one of the most critical challenges for unmanned systems. In this paper, we present a method that identifies obstacles in terms of ellipsoids to estimate linear and angular obstacle velocities. Our proposed method is based on the idea of any object can be approximately expressed by ellipsoids. To achieve this, we propose a method based on variational Bayesian estimation of Gaussian mixture model, the Kyachiyan algorithm, and a refinement algorithm. Our proposed method does not require knowledge of the number of clusters and can operate in real-time, unlike existing optimization-based methods. In addition, we define an ellipsoid-based feature vector to match obstacles given two timely close point frames. Our method can be applied to any environment with static and dynamic obstacles, including the ones with rotating obstacles. We compare our algorithm with other clustering methods and show that when coupled with a trajectory planner, the overall system can efficiently traverse unknown environments in the presence of dynamic obstacles.
    A General Stochastic Optimization Framework for Convergence Bidding. (arXiv:2210.06543v3 [math.OC] UPDATED)
    Convergence (virtual) bidding is an important part of two-settlement electric power markets as it can effectively reduce discrepancies between the day-ahead and real-time markets. Consequently, there is extensive research into the bidding strategies of virtual participants aiming to obtain optimal bids to submit to the day-ahead market. In this paper, we introduce a price-based general stochastic optimization framework to obtain optimal convergence bid curves. Within this framework, we develop a computationally tractable linear programming-based optimization model, which produces bid prices and volumes simultaneously. We also show that different approximations and simplifications in the general model lead naturally to state-of-the-art convergence bidding approaches, such as self-scheduling and opportunistic approaches. Our general framework also provides a straightforward way to compare the performance of these models, which is demonstrated by numerical experiments on the California (CAISO) market.
    Adversarial De-confounding in Individualised Treatment Effects Estimation. (arXiv:2210.10530v3 [cs.LG] UPDATED)
    Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
    Context-aware Deep Model for Entity Recommendation in Search Engine at Alibaba. (arXiv:1909.04493v1 [cs.IR] CROSS LISTED)
    Entity recommendation, providing search users with an improved experience via assisting them in finding related entities for a given query, has become an indispensable feature of today's search engines. Existing studies typically only consider the queries with explicit entities. They usually fail to handle complex queries that without entities, such as "what food is good for cold weather", because their models could not infer the underlying meaning of the input text. In this work, we believe that contexts convey valuable evidence that could facilitate the semantic modeling of queries, and take them into consideration for entity recommendation. In order to better model the semantics of queries and entities, we learn the representation of queries and entities jointly with attentive deep neural networks. We evaluate our approach using large-scale, real-world search logs from a widely used commercial Chinese search engine. Our system has been deployed in ShenMa Search Engine and you can fetch it in UC Browser of Alibaba. Results from online A/B test suggest that the impression efficiency of click-through rate increased by 5.1% and page view increased by 5.5%.
    Convergence of Random Reshuffling Under The Kurdyka-{\L}ojasiewicz Inequality. (arXiv:2110.04926v4 [math.OC] UPDATED)
    We study the random reshuffling (RR) method for smooth nonconvex optimization problems with a finite-sum structure. Though this method is widely utilized in practice such as the training of neural networks, its convergence behavior is only understood in several limited settings. In this paper, under the well-known Kurdyka-Lojasiewicz (KL) inequality, we establish strong limit-point convergence results for RR with appropriate diminishing step sizes, namely, the whole sequence of iterates generated by RR is convergent and converges to a single stationary point in an almost sure sense. In addition, we derive the corresponding rate of convergence, depending on the KL exponent and the suitably selected diminishing step sizes. When the KL exponent lies in $[0,\frac12]$, the convergence is at a rate of $\mathcal{O}(t^{-1})$ with $t$ counting the iteration number. When the KL exponent belongs to $(\frac12,1)$, our derived convergence rate is of the form $\mathcal{O}(t^{-q})$ with $q\in (0,1)$ depending on the KL exponent. The standard KL inequality-based convergence analysis framework only applies to algorithms with a certain descent property. We conduct a novel convergence analysis for the non-descent RR method with diminishing step sizes based on the KL inequality, which generalizes the standard KL framework. We summarize our main steps and core ideas in an informal analysis framework, which is of independent interest. As a direct application of this framework, we also establish similar strong limit-point convergence results for the reshuffled proximal point method.
    A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data. (arXiv:2207.00812v3 [q-bio.QM] UPDATED)
    There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. The paper provides a critical outlook into contemporary methods for explainability and interpretabiltiy used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific.
    Asymptotic Analysis of Deep Residual Networks. (arXiv:2212.08199v2 [cs.LG] UPDATED)
    We investigate the asymptotic properties of deep Residual networks (ResNets) as the number of layers increases. We first show the existence of scaling regimes for trained weights markedly different from those implicitly assumed in the neural ODE literature. We study the convergence of the hidden state dynamics in these scaling regimes, showing that one may obtain an ODE, a stochastic differential equation (SDE) or neither of these. In particular, our findings point to the existence of a diffusive regime in which the deep network limit is described by a class of stochastic differential equations (SDEs). Finally, we derive the corresponding scaling limits for the backpropagation dynamics.
    To be or not to be stable, that is the question: understanding neural networks for inverse problems. (arXiv:2211.13692v2 [math.NA] UPDATED)
    The solution of linear inverse problems arising, for example, in signal and image processing is a challenging problem since the ill-conditioning amplifies the noise on the data. Recently introduced algorithms based on deep learning overwhelm the more traditional model-based approaches, but they typically suffer from instability with respect to data perturbation. In this paper, we theoretically analyze the trade-off between neural networks stability and accuracy in the solution of linear inverse problems. Moreover, we propose different supervised and unsupervised solutions to increase network stability that maintains good accuracy by inheriting, in the network training, regularization from a model-based iterative scheme. Extensive numerical experiments on image deblurring confirm the theoretical results and the effectiveness of the proposed deep learning-based solutions to stably solve noisy inverse problems.
    Connecting metrics for shape-texture knowledge in computer vision. (arXiv:2301.10608v1 [cs.CV])
    Modern artificial neural networks, including convolutional neural networks and vision transformers, have mastered several computer vision tasks, including object recognition. However, there are many significant differences between the behavior and robustness of these systems and of the human visual system. Deep neural networks remain brittle and susceptible to many changes in the image that do not cause humans to misclassify images. Part of this different behavior may be explained by the type of features humans and deep neural networks use in vision tasks. Humans tend to classify objects according to their shape while deep neural networks seem to rely mostly on texture. Exploring this question is relevant, since it may lead to better performing neural network architectures and to a better understanding of the workings of the vision system of primates. In this work, we advance the state of the art in our understanding of this phenomenon, by extending previous analyses to a much larger set of deep neural network architectures. We found that the performance of models in image classification tasks is highly correlated with their shape bias measured at the output and penultimate layer. Furthermore, our results showed that the number of neurons that represent shape and texture are strongly anti-correlated, thus providing evidence that there is competition between these two types of features. Finally, we observed that while in general there is a correlation between performance and shape bias, there are significant variations between architecture families.
    FewShotTextGCN: K-hop neighborhood regularization for few-shot learning on graphs. (arXiv:2301.10481v1 [cs.CL])
    We present FewShotTextGCN, a novel method designed to effectively utilize the properties of word-document graphs for improved learning in low-resource settings. We introduce K-hop Neighbourhood Regularization, a regularizer for heterogeneous graphs, and show that it stabilizes and improves learning when only a few training samples are available. We furthermore propose a simplification in the graph-construction method, which results in a graph that is $\sim$7 times less dense and yields better performance in little-resource settings while performing on par with the state of the art in high-resource settings. Finally, we introduce a new variant of Adaptive Pseudo-Labeling tailored for word-document graphs. When using as little as 20 samples for training, we outperform a strong TextGCN baseline with 17% in absolute accuracy on average over eight languages. We demonstrate that our method can be applied to document classification without any language model pretraining on a wide range of typologically diverse languages while performing on par with large pretrained language models.
    User-Interactive Offline Reinforcement Learning. (arXiv:2205.10629v2 [cs.LG] UPDATED)
    Offline reinforcement learning algorithms still lack trust in practice due to the risk that the learned policy performs worse than the original policy that generated the dataset or behaves in an unexpected way that is unfamiliar to the user. At the same time, offline RL algorithms are not able to tune their most important hyperparameter - the proximity of the learned policy to the original policy. We propose an algorithm that allows the user to tune this hyperparameter at runtime, thereby addressing both of the above mentioned issues simultaneously. This allows users to start with the original behavior and grant successively greater deviation, as well as stopping at any time when the policy deteriorates or the behavior is too far from the familiar one.
    Near-Optimal No-Regret Learning for Correlated Equilibria in Multi-Player General-Sum Games. (arXiv:2111.06008v3 [cs.LG] UPDATED)
    Recently, Daskalakis, Fishelson, and Golowich (DFG) (NeurIPS`21) showed that if all agents in a multi-player general-sum normal-form game employ Optimistic Multiplicative Weights Update (OMWU), the external regret of every player is $O(\textrm{polylog}(T))$ after $T$ repetitions of the game. We extend their result from external regret to internal regret and swap regret, thereby establishing uncoupled learning dynamics that converge to an approximate correlated equilibrium at the rate of $\tilde{O}(T^{-1})$. This substantially improves over the prior best rate of convergence for correlated equilibria of $O(T^{-3/4})$ due to Chen and Peng (NeurIPS`20), and it is optimal -- within the no-regret framework -- up to polylogarithmic factors in $T$. To obtain these results, we develop new techniques for establishing higher-order smoothness for learning dynamics involving fixed point operations. Specifically, we establish that the no-internal-regret learning dynamics of Stoltz and Lugosi (Mach Learn`05) are equivalently simulated by no-external-regret dynamics on a combinatorial space. This allows us to trade the computation of the stationary distribution on a polynomial-sized Markov chain for a (much more well-behaved) linear transformation on an exponential-sized set, enabling us to leverage similar techniques as DFG to near-optimally bound the internal regret. Moreover, we establish an $O(\textrm{polylog}(T))$ no-swap-regret bound for the classic algorithm of Blum and Mansour (BM) (JMLR`07). We do so by introducing a technique based on the Cauchy Integral Formula that circumvents the more limited combinatorial arguments of DFG. In addition to shedding clarity on the near-optimal regret guarantees of BM, our arguments provide insights into the various ways in which the techniques by DFG can be extended and leveraged in the analysis of more involved learning algorithms.
    Distributed Control of Partial Differential Equations Using Convolutional Reinforcement Learning. (arXiv:2301.10737v1 [cs.LG])
    We present a convolutional framework which significantly reduces the complexity and thus, the computational effort for distributed reinforcement learning control of dynamical systems governed by partial differential equations (PDEs). Exploiting translational invariances, the high-dimensional distributed control problem can be transformed into a multi-agent control problem with many identical, uncoupled agents. Furthermore, using the fact that information is transported with finite velocity in many cases, the dimension of the agents' environment can be drastically reduced using a convolution operation over the state space of the PDE. In this setting, the complexity can be flexibly adjusted via the kernel width or by using a stride greater than one. Moreover, scaling from smaller to larger systems -- or the transfer between different domains -- becomes a straightforward task requiring little effort. We demonstrate the performance of the proposed framework using several PDE examples with increasing complexity, where stabilization is achieved by training a low-dimensional deep deterministic policy gradient agent using minimal computing resources.
    Tighter Bounds on the Expressivity of Transformer Encoders. (arXiv:2301.10743v1 [cs.LG])
    Characterizing neural networks in terms of better-understood formal systems has the potential to yield new insights into the power and limitations of these networks. Doing so for transformers remains an active area of research. Bhattamishra and others have shown that transformer encoders are at least as expressive as a certain kind of counter machine, while Merrill and Sabharwal have shown that fixed-precision transformer encoders recognize only languages in uniform $TC^0$. We connect and strengthen these results by identifying a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders. This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
    Spatio-Temporal Graph Neural Networks: A Survey. (arXiv:2301.10569v1 [cs.LG])
    Graph Neural Networks have gained huge interest in the past few years. These powerful algorithms expanded deep learning models to non-Euclidean space and were able to achieve state of art performance in various applications including recommender systems and social networks. However, this performance is based on static graph structures assumption which limits the Graph Neural Networks performance when the data varies with time. Temporal Graph Neural Networks are extension of Graph Neural Networks that takes the time factor into account. Recently, various Temporal Graph Neural Network algorithms were proposed and achieved superior performance compared to other deep learning algorithms in several time dependent applications. This survey discusses interesting topics related to Spatio temporal Graph Neural Networks, including algorithms, application, and open challenges.
    Evaluation of the syllables pronunciation quality in speech rehabilitation through the solution of the classification problem. (arXiv:2301.10585v1 [cs.LG])
    The solution of the problem of assessing the quality of the pronunciation of syllables during speech rehabilitation after surgical treatment of oncological diseases of the organs of the speech-forming tract is considered in the work. The assessment is carried out by solving the problem of classifying syllables into two classes: before and immediately after surgical treatment. A classifier is built on the basis of the LSTM neural network and trained on the records before the operation and immediately after it, before the start of speech rehabilitation. The measure of assessing the quality of syllables pronunciation in the process of rehabilitation is the metric of belonging to the class before the operation. A study is being made of the influence of taking into account problematic phonemes, the gender of the patient, his individual characteristics on the resulting estimates of the quality of pronunciation. A comparison with existing types of syllable pronunciation quality assessments is carried out, recommendations are given for the practical application of the resulting new class of pronunciation quality assessments.
    Prediction of COVID-19 by Its Variants using Multivariate Data-driven Deep Learning Models. (arXiv:2301.10616v1 [cs.CE])
    The Coronavirus Disease 2019 or the COVID-19 pandemic has swept almost all parts of the world since the first case was found in Wuhan, China, in December 2019. With the increasing number of COVID-19 cases in the world, SARS-CoV-2 has mutated into various variants. Given the increasingly dangerous conditions of the pandemic, it is crucial to know when the pandemic will stop by predicting confirmed cases of COVID-19. Therefore, many studies have raised COVID-19 as a case study to overcome the ongoing pandemic using the Deep Learning method, namely LSTM, with reasonably accurate results and small error values. LSTM training is used to predict confirmed cases of COVID-19 based on variants that have been identified using ECDC's COVID-19 dataset containing confirmed cases of COVID-19 that have been identified from 30 countries in Europe. Tests were conducted using the LSTM and BiLSTM models with the addition of RNN as comparisons on hidden size and layer size. The obtained result showed that in testing hidden sizes 25, 50, 75 to 100, the RNN model provided better results, with the minimum MSE value of 0.01 and the RMSE value of 0.012 for B.1.427/B.1.429 variant with hidden size 100. In further testing of layer sizes 2, 3, 4, and 5, the result shows that the BiLSTM model provided better results, with minimum MSE value of 0.01 and the RMSE of 0.01 for the B.1.427/B.1.429 variant with hidden size 100 and layer size 2.
    Learning Policies with Zero or Bounded Constraint Violation for Constrained MDPs. (arXiv:2106.02684v3 [cs.LG] UPDATED)
    We address the issue of safety in reinforcement learning. We pose the problem in an episodic framework of a constrained Markov decision process. Existing results have shown that it is possible to achieve a reward regret of $\tilde{\mathcal{O}}(\sqrt{K})$ while allowing an $\tilde{\mathcal{O}}(\sqrt{K})$ constraint violation in $K$ episodes. A critical question that arises is whether it is possible to keep the constraint violation even smaller. We show that when a strictly safe policy is known, then one can confine the system to zero constraint violation with arbitrarily high probability while keeping the reward regret of order $\tilde{\mathcal{O}}(\sqrt{K})$. The algorithm which does so employs the principle of optimistic pessimism in the face of uncertainty to achieve safe exploration. When no strictly safe policy is known, though one is known to exist, then it is possible to restrict the system to bounded constraint violation with arbitrarily high probability. This is shown to be realized by a primal-dual algorithm with an optimistic primal estimate and a pessimistic dual update.
    Trainable Loss Weights in Super-Resolution. (arXiv:2301.10575v1 [cs.CV])
    In recent years, research on super-resolution has primarily focused on the development of unsupervised models, blind networks, and the use of optimization methods in non-blind models. But, limited research has discussed the loss function in the super-resolution process. The majority of those studies have only used perceptual similarity in a conventional way. This is while the development of appropriate loss can improve the quality of other methods as well. In this article, a new weighting method for pixel-wise loss is proposed. With the help of this method, it is possible to use trainable weights based on the general structure of the image and its perceptual features while maintaining the advantages of pixel-wise loss. Also, a criterion for comparing weights of loss is introduced so that the weights can be estimated directly by a convolutional neural network using this criterion. In addition, in this article, the expectation-maximization method is used for the simultaneous estimation super-resolution network and weighting network. In addition, a new activation function, called "FixedSum", is introduced which can keep the sum of all components of vector constants while keeping the output components between zero and one. As shown in the experimental results section, weighted loss by the proposed method leads to better results than the unweighted loss in both signal-to-noise and perceptual similarity senses.
    Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and Self-Training. (arXiv:2206.11215v3 [cs.CV] UPDATED)
    We consider a certifiable object pose estimation problem, where -- given a partial point cloud of an object -- the goal is to not only estimate the object pose, but also to provide a certificate of correctness for the resulting estimate. Our first contribution is a general theory of certification for end-to-end perception models. In particular, we introduce the notion of $\zeta$-correctness, which bounds the distance between an estimate and the ground truth. We show that $\zeta$-correctness can be assessed by implementing two certificates: (i) a certificate of observable correctness, that asserts if the model output is consistent with the input data and prior information, (ii) a certificate of non-degeneracy, that asserts whether the input data is sufficient to compute a unique estimate. Our second contribution is to apply this theory and design a new learning-based certifiable pose estimator. We propose C-3PO, a semantic-keypoint-based pose estimation model, augmented with the two certificates, to solve the certifiable pose estimation problem. C-3PO also includes a keypoint corrector, implemented as a differentiable optimization layer, that can correct large detection errors (e.g. due to the sim-to-real gap). Our third contribution is a novel self-supervised training approach that uses our certificate of observable correctness to provide the supervisory signal to C-3PO during training. In it, the model trains only on the observably correct input-output pairs, in each training iteration. As training progresses, we see that the observably correct input-output pairs grow, eventually reaching near 100% in many cases. Our experiments show that (i) standard semantic-keypoint-based methods outperform more recent alternatives, (ii) C-3PO further improves performance and significantly outperforms all the baselines, and (iii) C-3PO's certificates are able to discern correct pose estimates.
    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Bridging Text and Knowledge with Multi-Prototype Embedding for Few-Shot Relational Triple Extraction. (arXiv:2010.16059v1 [cs.CL] CROSS LISTED)
    Current supervised relational triple extraction approaches require huge amounts of labeled data and thus suffer from poor performance in few-shot settings. However, people can grasp new knowledge by learning a few instances. To this end, we take the first step to study the few-shot relational triple extraction, which has not been well understood. Unlike previous single-task few-shot problems, relational triple extraction is more challenging as the entities and relations have implicit correlations. In this paper, We propose a novel multi-prototype embedding network model to jointly extract the composition of relational triples, namely, entity pairs and corresponding relations. To be specific, we design a hybrid prototypical learning mechanism that bridges text and knowledge concerning both entities and relations. Thus, implicit correlations between entities and relations are injected. Additionally, we propose a prototype-aware regularization to learn more representative prototypes. Experimental results demonstrate that the proposed method can improve the performance of the few-shot triple extraction.
    LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting. (arXiv:2109.00720v5 [cs.CL] CROSS LISTED)
    Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings. Code is in https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot.
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v3 [cs.CV] UPDATED)
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.
    PULL: Reactive Log Anomaly Detection Based On Iterative PU Learning. (arXiv:2301.10681v1 [cs.LG])
    Due to the complexity of modern IT services, failures can be manifold, occur at any stage, and are hard to detect. For this reason, anomaly detection applied to monitoring data such as logs allows gaining relevant insights to improve IT services steadily and eradicate failures. However, existing anomaly detection methods that provide high accuracy often rely on labeled training data, which are time-consuming to obtain in practice. Therefore, we propose PULL, an iterative log analysis method for reactive anomaly detection based on estimated failure time windows provided by monitoring systems instead of labeled data. Our attention-based model uses a novel objective function for weak supervision deep learning that accounts for imbalanced data and applies an iterative learning strategy for positive and unknown samples (PU learning) to identify anomalous logs. Our evaluation shows that PULL consistently outperforms ten benchmark baselines across three different datasets and detects anomalous log messages with an F1-score of more than 0.99 even within imprecise failure time windows.
    Probing Taxonomic and Thematic Embeddings for Taxonomic Information. (arXiv:2301.10656v1 [cs.CL])
    Modelling taxonomic and thematic relatedness is important for building AI with comprehensive natural language understanding. The goal of this paper is to learn more about how taxonomic information is structurally encoded in embeddings. To do this, we design a new hypernym-hyponym probing task and perform a comparative probing study of taxonomic and thematic SGNS and GloVe embeddings. Our experiments indicate that both types of embeddings encode some taxonomic information, but the amount, as well as the geometric properties of the encodings, are independently related to both the encoder architecture, as well as the embedding training data. Specifically, we find that only taxonomic embeddings carry taxonomic information in their norm, which is determined by the underlying distribution in the data.
    XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. (arXiv:2301.10472v1 [cs.CL])
    Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This vocabulary bottleneck limits the representational capabilities of multilingual models like XLM-R. In this paper, we introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap and assigning vocabulary capacity to achieve sufficient coverage for each individual language. Tokenizations using our vocabulary are typically more semantically meaningful and shorter compared to XLM-R. Leveraging this improved vocabulary, we train XLM-V, a multilingual language model with a one million token vocabulary. XLM-V outperforms XLM-R on every task we tested on ranging from natural language inference (XNLI), question answering (MLQA, XQuAD, TyDiQA), and named entity recognition (WikiAnn) to low-resource tasks (Americas NLI, MasakhaNER).
    A Boosting Approach to Reinforcement Learning. (arXiv:2108.09767v2 [cs.LG] UPDATED)
    Reducing reinforcement learning to supervised learning is a well-studied and effective approach that leverages the benefits of compact function approximation to deal with large-scale Markov decision processes. Independently, the boosting methodology (e.g. AdaBoost) has proven to be indispensable in designing efficient and accurate classification algorithms by combining inaccurate rules-of-thumb. In this paper, we take a further step: we reduce reinforcement learning to a sequence of weak learning problems. Since weak learners perform only marginally better than random guesses, such subroutines constitute a weaker assumption than the availability of an accurate supervised learning oracle. We prove that the sample complexity and running time bounds of the proposed method do not explicitly depend on the number of states. While existing results on boosting operate on convex losses, the value function over policies is non-convex. We show how to use a non-convex variant of the Frank-Wolfe method for boosting, that additionally improves upon the known sample complexity and running time even for reductions to supervised learning.
    Transfer Learning in Deep Learning Models for Building Load Forecasting: Case of Limited Data. (arXiv:2301.10663v1 [cs.LG])
    Precise load forecasting in buildings could increase the bill savings potential and facilitate optimized strategies for power generation planning. With the rapid evolution of computer science, data-driven techniques, in particular the Deep Learning models, have become a promising solution for the load forecasting problem. These models have showed accurate forecasting results; however, they need abundance amount of historical data to maintain the performance. Considering the new buildings and buildings with low resolution measuring equipment, it is difficult to get enough historical data from them, leading to poor forecasting performance. In order to adapt Deep Learning models for buildings with limited and scarce data, this paper proposes a Building-to-Building Transfer Learning framework to overcome the problem and enhance the performance of Deep Learning models. The transfer learning approach was applied to a new technique known as Transformer model due to its efficacy in capturing data trends. The performance of the algorithm was tested on a large commercial building with limited data. The result showed that the proposed approach improved the forecasting accuracy by 56.8% compared to the case of conventional deep learning where training from scratch is used. The paper also compared the proposed Transformer model to other sequential deep learning models such as Long-short Term Memory (LSTM) and Recurrent Neural Network (RNN). The accuracy of the transformer model outperformed other models by reducing the root mean square error to 0.009, compared to LSTM with 0.011 and RNN with 0.051.
    Dimensionality Expansion of Load Monitoring Time Series and Transfer Learning for EMS. (arXiv:2204.02802v3 [cs.LG] UPDATED)
    Energy management systems (EMS) rely on (non)-intrusive load monitoring (N)ILM to monitor and manage appliances and help residents be more energy efficient and thus more frugal. The robustness as well as the transfer potential of the most promising machine learning solutions for (N)ILM is not yet fully understood as they are trained and evaluated on relatively limited data. In this paper, we propose a new approach for load monitoring in building EMS based on dimensionality expansion of time series and transfer learning. We perform an extensive evaluation on 5 different low-frequency datasets. The proposed feature dimensionality expansion using video-like transformation and resource-aware deep learning architecture achieves an average weighted F1 score of 0.88 across the datasets with 29 appliances and is computationally more efficient compared to the state-of-the-art imaging methods. Investigating the proposed method for cross-dataset intra-domain transfer learning, we find that 1) our method performs with an average weighted F1 score of 0.80 while requiring 3-times fewer epochs for model training compared to the non-transfer approach, 2) can achieve an F1 score of 0.75 with only 230 data samples, and 3) our transfer approach outperforms the state-of-the-art in precision drop by up to 12 percentage points for unseen appliances.
    What are the Machine Learning best practices reported by practitioners on Stack Exchange?. (arXiv:2301.10516v1 [cs.SE])
    Machine Learning (ML) is being used in multiple disciplines due to its powerful capability to infer relationships within data. In particular, Software Engineering (SE) is one of those disciplines in which ML has been used for multiple tasks, like software categorization, bugs prediction, and testing. In addition to the multiple ML applications, some studies have been conducted to detect and understand possible pitfalls and issues when using ML. However, to the best of our knowledge, only a few studies have focused on presenting ML best practices or guidelines for the application of ML in different domains. In addition, the practices and literature presented in previous literature (i) are domain-specific (e.g., concrete practices in biomechanics), (ii) describe few practices, or (iii) the practices lack rigorous validation and are presented in gray literature. In this paper, we present a study listing 127 ML best practices systematically mining 242 posts of 14 different Stack Exchange (STE) websites and validated by four independent ML experts. The list of practices is presented in a set of categories related to different stages of the implementation process of an ML-enabled system; for each practice, we include explanations and examples. In all the practices, the provided examples focus on SE tasks. We expect this list of practices could help practitioners to understand better the practices and use ML in a more informed way, in particular newcomers to this new area that sits at the intersection of software engineering and machine learning.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Backward Compatibility During Data Updates by Weight Interpolation. (arXiv:2301.10546v1 [cs.LG])
    Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified. A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    Automated multilingual detection of Pro-Kremlin propaganda in newspapers and Telegram posts. (arXiv:2301.10604v1 [cs.CL])
    The full-scale conflict between the Russian Federation and Ukraine generated an unprecedented amount of news articles and social media data reflecting opposing ideologies and narratives. These polarized campaigns have led to mutual accusations of misinformation and fake news, shaping an atmosphere of confusion and mistrust for readers worldwide. This study analyses how the media affected and mirrored public opinion during the first month of the war using news articles and Telegram news channels in Ukrainian, Russian, Romanian and English. We propose and compare two methods of multilingual automated pro-Kremlin propaganda identification, based on Transformers and linguistic features. We analyse the advantages and disadvantages of both methods, their adaptability to new genres and languages, and ethical considerations of their usage for content moderation. With this work, we aim to lay the foundation for further development of moderation tools tailored to the current conflict.  ( 2 min )
    E(n)-equivariant Graph Neural Cellular Automata. (arXiv:2301.10497v1 [cs.LG])
    Cellular automata (CAs) are computational models exhibiting rich dynamics emerging from the local interaction of cells arranged in a regular lattice. Graph CAs (GCAs) generalise standard CAs by allowing for arbitrary graphs rather than regular lattices, similar to how Graph Neural Networks (GNNs) generalise Convolutional NNs. Recently, Graph Neural CAs (GNCAs) have been proposed as models built on top of standard GNNs that can be trained to approximate the transition rule of any arbitrary GCA. Existing GNCAs are anisotropic in the sense that their transition rules are not equivariant to translation, rotation, and reflection of the nodes' spatial locations. However, it is desirable for instances related by such transformations to be treated identically by the model. By replacing standard graph convolutions with E(n)-equivariant ones, we avoid anisotropy by design and propose a class of isotropic automata that we call E(n)-GNCAs. These models are lightweight, but can nevertheless handle large graphs, capture complex dynamics and exhibit emergent self-organising behaviours. We showcase the broad and successful applicability of E(n)-GNCAs on three different tasks: (i) pattern formation, (ii) graph auto-encoding, and (iii) simulation of E(n)-equivariant dynamical systems.  ( 2 min )
    When to Trust Aggregated Gradients: Addressing Negative Client Sampling in Federated Learning. (arXiv:2301.10400v1 [cs.LG])
    Federated Learning has become a widely-used framework which allows learning a global model on decentralized local datasets under the condition of protecting local data privacy. However, federated learning faces severe optimization difficulty when training samples are not independently and identically distributed (non-i.i.d.). In this paper, we point out that the client sampling practice plays a decisive role in the aforementioned optimization difficulty. We find that the negative client sampling will cause the merged data distribution of currently sampled clients heavily inconsistent with that of all available clients, and further make the aggregated gradient unreliable. To address this issue, we propose a novel learning rate adaptation mechanism to adaptively adjust the server learning rate for the aggregated gradient in each round, according to the consistency between the merged data distribution of currently sampled clients and that of all available clients. Specifically, we make theoretical deductions to find a meaningful and robust indicator that is positively related to the optimal server learning rate and can effectively reflect the merged data distribution of sampled clients, and we utilize it for the server learning rate adaptation. Extensive experiments on multiple image and text classification tasks validate the great effectiveness of our method.  ( 2 min )
    Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes. (arXiv:2206.08852v2 [cs.LG] UPDATED)
    Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.
    Improved Stock Price Movement Classification Using News Articles Based on Embeddings and Label Smoothing. (arXiv:2301.10458v1 [cs.LG])
    Stock price movement prediction is a challenging and essential problem in finance. While it is well established in modern behavioral finance that the share prices of related stocks often move after the release of news via reactions and overreactions of investors, how to capture the relationships between price movements and news articles via quantitative models is an active area research; existing models have achieved success with variable degrees. In this paper, we propose to improve stock price movement classification using news articles by incorporating regularization and optimization techniques from deep learning. More specifically, we capture the dependencies between news articles and stocks through embeddings and bidirectional recurrent neural networks as in recent models. We further incorporate weight decay, batch normalization, dropout, and label smoothing to improve the generalization of the trained models. To handle high fluctuations of validation accuracy of batch normalization, we propose dual-phase training to realize the improvements reliably. Our experimental results on a commonly used dataset show significant improvements, achieving average accuracy of 80.7% on the test set, which is more than 10.0% absolute improvement over existing models. Our ablation studies show batch normalization and label smoothing are most effective, leading to 6.0% and 3.4% absolute improvement, respectively on average.  ( 2 min )
    SGCN: Exploiting Compressed-Sparse Features in Deep Graph Convolutional Network Accelerators. (arXiv:2301.10388v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they overcome the limited applicability of prior neural networks. A GCN takes as input an arbitrarily structured graph and executes a series of layers which exploit the graph's structure to calculate their output features. One recent trend in GCNs is the use of deep network architectures. As opposed to the traditional GCNs which only span around two to five layers deep, modern GCNs now incorporate tens to hundreds of layers with the help of residual connections. From such deep GCNs, we find an important characteristic that they exhibit very high intermediate feature sparsity. We observe that with deep layers and residual connections, the number of zeros in the intermediate features sharply increases. This reveals a new opportunity for accelerators to exploit in GCN executions that was previously not present. In this paper, we propose SGCN, a fast and energy-efficient GCN accelerator which fully exploits the sparse intermediate features of modern GCNs. SGCN suggests several techniques to achieve significantly higher performance and energy efficiency than the existing accelerators. First, SGCN employs a GCN-friendly feature compression format. We focus on reducing the off-chip memory traffic, which often is the bottleneck for GCN executions. Second, we propose microarchitectures for seamlessly handling the compressed feature format. Third, to better handle locality in the existence of the varying sparsity, SGCN employs sparsity-aware cooperation. Sparsity-aware cooperation creates a pattern that exhibits multiple reuse windows, such that the cache can capture diverse sizes of working sets and therefore adapt to the varying level of sparsity. We show that SGCN achieves 1.71x speedup and 43.9% higher energy efficiency compared to the existing accelerators.  ( 2 min )
    Overcoming Prior Misspecification in Online Learning to Rank. (arXiv:2301.10651v1 [cs.LG])
    The recent literature on online learning to rank (LTR) has established the utility of prior knowledge to Bayesian ranking bandit algorithms. However, a major limitation of existing work is the requirement for the prior used by the algorithm to match the true prior. In this paper, we propose and analyze adaptive algorithms that address this issue and additionally extend these results to the linear and generalized linear models. We also consider scalar relevance feedback on top of click feedback. Moreover, we demonstrate the efficacy of our algorithms using both synthetic and real-world experiments.  ( 2 min )
    Banker Online Mirror Descent: A Universal Approach for Delayed Online Bandit Learning. (arXiv:2301.10500v1 [cs.LG])
    We propose `Banker-OMD`, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in the online learning literature. The `Banker-OMD` framework almost completely decouples feedback delay handling and the task-specific OMD algorithm design, thus allowing the easy design of new algorithms capable of easily and robustly handling feedback delays. Specifically, it offers a general methodology for achieving $\tilde{\mathcal O}(\sqrt{T} + \sqrt{D})$-style regret bounds in online bandit learning tasks with delayed feedback, where $T$ is the number of rounds and $D$ is the total feedback delay. We demonstrate the power of \texttt{Banker-OMD} by applications to two important bandit learning scenarios with delayed feedback, including delayed scale-free adversarial Multi-Armed Bandits (MAB) and delayed adversarial linear bandits. `Banker-OMD` leads to the first delayed scale-free adversarial MAB algorithm achieving $\tilde{\mathcal O}(\sqrt{K(D+T)}L)$ regret and the first delayed adversarial linear bandit algorithm achieving $\tilde{\mathcal O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret. As a corollary, the first application also implies $\tilde{\mathcal O}(\sqrt{KT}L)$ regret for non-delayed scale-free adversarial MABs, which is the first to match the $\Omega(\sqrt{KT}L)$ lower bound up to logarithmic factors and can be of independent interest.  ( 2 min )
    Integrating Local Real Data with Global Gradient Prototypes for Classifier Re-Balancing in Federated Long-Tailed Learning. (arXiv:2301.10394v1 [cs.LG])
    Federated Learning (FL) has become a popular distributed learning paradigm that involves multiple clients training a global model collaboratively in a data privacy-preserving manner. However, the data samples usually follow a long-tailed distribution in the real world, and FL on the decentralized and long-tailed data yields a poorly-behaved global model severely biased to the head classes with the majority of the training samples. To alleviate this issue, decoupled training has recently been introduced to FL, considering it has achieved promising results in centralized long-tailed learning by re-balancing the biased classifier after the instance-balanced training. However, the current study restricts the capacity of decoupled training in federated long-tailed learning with a sub-optimal classifier re-trained on a set of pseudo features, due to the unavailability of a global balanced dataset in FL. In this work, in order to re-balance the classifier more effectively, we integrate the local real data with the global gradient prototypes to form the local balanced datasets, and thus re-balance the classifier during the local training. Furthermore, we introduce an extra classifier in the training phase to help model the global data distribution, which addresses the problem of contradictory optimization goals caused by performing classifier re-balancing locally. Extensive experiments show that our method consistently outperforms the existing state-of-the-art methods in various settings.  ( 2 min )
    ViDeBERTa: A powerful pre-trained language model for Vietnamese. (arXiv:2301.10439v1 [cs.CL])
    This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at: https://github.com/HySonLab/ViDeBERTa.  ( 2 min )
    Capacity Analysis of Vector Symbolic Architectures. (arXiv:2301.10352v1 [cs.LG])
    Hyperdimensional computing (HDC) is a biologically-inspired framework that uses high-dimensional vectors and various vector operations to represent and manipulate symbols. The ensemble of a particular vector space and two vector operations (one addition-like for "bundling" and one outer-product-like for "binding") form what is called a "vector symbolic architecture" (VSA). While VSAs have been employed in numerous applications and studied empirically, many theoretical questions about VSAs remain open. We provide theoretical analyses for the *representation capacities* of three popular VSAs: MAP-I, MAP-B, and Binary Sparse. Representation capacity here refers to upper bounds on the dimensions of the VSA vectors required to perform certain symbolic tasks (such as testing for set membership $i \in S$ and estimating set intersection sizes $|S \cap T|$) to a given degree of accuracy. We also describe a relationship between the MAP-I VSA to Hopfield networks, which are simple models of associative memory, and analyze the ability of Hopfield networks to perform some of the same tasks that are typically asked of VSAs. Our analysis of MAP-I casts the VSA vectors as the outputs of *sketching* (dimensionality reduction) algorithms such as the Johnson-Lindenstrauss transform; this provides a clean, simple framework for obtaining bounds on MAP-I's representation capacity. We also provide, to our knowledge, the first analysis of testing set membership in a bundle of general pairwise bindings from MAP-I. Binary sparse VSAs are well-known to be related to Bloom filters; we give analyses of set intersection for Bloom and Counting Bloom filters. Our analysis of MAP-B and Binary Sparse bundling include new applications of several concentration inequalities.  ( 2 min )
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.  ( 2 min )
    HealthEdge: A Machine Learning-Based Smart Healthcare Framework for Prediction of Type 2 Diabetes in an Integrated IoT, Edge, and Cloud Computing System. (arXiv:2301.10450v1 [cs.LG])
    Diabetes Mellitus has no permanent cure to date and is one of the leading causes of death globally. The alarming increase in diabetes calls for the need to take precautionary measures to avoid/predict the occurrence of diabetes. This paper proposes HealthEdge, a machine learning-based smart healthcare framework for type 2 diabetes prediction in an integrated IoT-edge-cloud computing system. Numerical experiments and comparative analysis were carried out between the two most used machine learning algorithms in the literature, Random Forest (RF) and Logistic Regression (LR), using two real-life diabetes datasets. The results show that RF predicts diabetes with 6% more accuracy on average compared to LR.  ( 2 min )
    AutoCost: Evolving Intrinsic Cost for Zero-violation Reinforcement Learning. (arXiv:2301.10339v1 [cs.LG])
    Safety is a critical hurdle that limits the application of deep reinforcement learning (RL) to real-world control tasks. To this end, constrained reinforcement learning leverages cost functions to improve safety in constrained Markov decision processes. However, such constrained RL methods fail to achieve zero violation even when the cost limit is zero. This paper analyzes the reason for such failure, which suggests that a proper cost function plays an important role in constrained RL. Inspired by the analysis, we propose AutoCost, a simple yet effective framework that automatically searches for cost functions that help constrained RL to achieve zero-violation performance. We validate the proposed method and the searched cost function on the safe RL benchmark Safety Gym. We compare the performance of augmented agents that use our cost function to provide additive intrinsic costs with baseline agents that use the same policy learners but with only extrinsic costs. Results show that the converged policies with intrinsic costs in all environments achieve zero constraint violation and comparable performance with baselines.  ( 2 min )
    A Data-Centric Approach for Improving Adversarial Training Through the Lens of Out-of-Distribution Detection. (arXiv:2301.10454v1 [cs.LG])
    Current machine learning models achieve super-human performance in many real-world applications. Still, they are susceptible against imperceptible adversarial perturbations. The most effective solution for this problem is adversarial training that trains the model with adversarially perturbed samples instead of original ones. Various methods have been developed over recent years to improve adversarial training such as data augmentation or modifying training attacks. In this work, we examine the same problem from a new data-centric perspective. For this purpose, we first demonstrate that the existing model-based methods can be equivalent to applying smaller perturbation or optimization weights to the hard training examples. By using this finding, we propose detecting and removing these hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects. For detection, we use maximum softmax probability as an effective method in out-of-distribution detection since we can consider the hard samples as the out-of-distribution samples for the whole data distribution. Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.  ( 2 min )
    Near-Optimal No-Regret Learning in General Games. (arXiv:2108.06924v2 [cs.LG] UPDATED)
    We show that Optimistic Hedge -- a common variant of multiplicative-weights-updates with recency bias -- attains ${\rm poly}(\log T)$ regret in multi-player general-sum games. In particular, when every player of the game uses Optimistic Hedge to iteratively update her strategy in response to the history of play so far, then after $T$ rounds of interaction, each player experiences total regret that is ${\rm poly}(\log T)$. Our bound improves, exponentially, the $O({T}^{1/2})$ regret attainable by standard no-regret learners in games, the $O(T^{1/4})$ regret attainable by no-regret learners with recency bias (Syrgkanis et al., 2015), and the ${O}(T^{1/6})$ bound that was recently shown for Optimistic Hedge in the special case of two-player games (Chen & Pen, 2020). A corollary of our bound is that Optimistic Hedge converges to coarse correlated equilibrium in general games at a rate of $\tilde{O}\left(\frac 1T\right)$.
    A Provable Splitting Approach for Symmetric Nonnegative Matrix Factorization. (arXiv:2301.10499v1 [cs.LG])
    The symmetric Nonnegative Matrix Factorization (NMF), a special but important class of the general NMF, has found numerous applications in data analysis such as various clustering tasks. Unfortunately, designing fast algorithms for the symmetric NMF is not as easy as for its nonsymmetric counterpart, since the latter admits the splitting property that allows state-of-the-art alternating-type algorithms. To overcome this issue, we first split the decision variable and transform the symmetric NMF to a penalized nonsymmetric one, paving the way for designing efficient alternating-type algorithms. We then show that solving the penalized nonsymmetric reformulation returns a solution to the original symmetric NMF. Moreover, we design a family of alternating-type algorithms and show that they all admit strong convergence guarantee: the generated sequence of iterates is convergent and converges at least sublinearly to a critical point of the original symmetric NMF. Finally, we conduct experiments on both synthetic data and real image clustering to support our theoretical results and demonstrate the performance of the alternating-type algorithms.  ( 2 min )
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System. (arXiv:2301.10577v1 [cs.CL])
    In this work, we present ARDIAS, a web-based application that aims to provide researchers with a full suite of discovery and collaboration tools. ARDIAS currently allows searching for authors and articles by name and gaining insights into the research topics of a particular researcher. With the aid of AI-based tools, ARDIAS aims to recommend potential collaborators and topics to researchers. In the near future, we aim to add tools that allow researchers to communicate with each other and start new projects.
    Multimodal Analogical Reasoning over Knowledge Graphs. (arXiv:2210.00312v3 [cs.CL] UPDATED)
    Analogical reasoning is fundamental to human cognition and holds an important place in various fields. However, previous studies mainly focus on single-modal analogical reasoning and ignore taking advantage of structure knowledge. Notably, the research in cognitive psychology has demonstrated that information from multimodal sources always brings more powerful cognitive transfer than single modality sources. To this end, we introduce the new task of multimodal analogical reasoning over knowledge graphs, which requires multimodal reasoning ability with the help of background knowledge. Specifically, we construct a Multimodal Analogical Reasoning dataSet (MARS) and a multimodal knowledge graph MarKG. We evaluate with multimodal knowledge graph embedding and pre-trained Transformer baselines, illustrating the potential challenges of the proposed task. We further propose a novel model-agnostic Multimodal analogical reasoning framework with Transformer (MarT) motivated by the structure mapping theory, which can obtain better performance. Code and datasets are available in https://github.com/zjunlp/MKG_Analogy.
    OCD: Learning to Overfit with Conditional Diffusion Models. (arXiv:2210.00471v4 [cs.LG] UPDATED)
    We present a dynamic model in which the weights are conditioned on an input sample x and are learned to match those that would be obtained by finetuning a base model on x and its label y. This mapping between an input sample and network weights is approximated by a denoising diffusion model. The diffusion model we employ focuses on modifying a single layer of the base model and is conditioned on the input, activations, and output of this layer. Since the diffusion model is stochastic in nature, multiple initializations generate different networks, forming an ensemble, which leads to further improvements. Our experiments demonstrate the wide applicability of the method for image classification, 3D reconstruction, tabular data, speech separation, and natural language processing. Our code is available at https://github.com/ShaharLutatiPersonal/OCD
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    A blob method for inhomogeneous diffusion with applications to multi-agent control and sampling. (arXiv:2202.12927v3 [math.AP] UPDATED)
    As a counterpoint to classical stochastic particle methods for linear diffusion equations, we develop a deterministic particle method for the weighted porous medium equation (WPME) and prove its convergence on bounded time intervals. This generalizes related work on blob methods for unweighted porous medium equations. From a numerical analysis perspective, our method has several advantages: it is meshfree, preserves the gradient flow structure of the underlying PDE, converges in arbitrary dimension, and captures the correct asymptotic behavior in simulations. That our method succeeds in capturing the long time behavior of WPME is significant from the perspective of related problems in quantization. Just as the Fokker-Planck equation provides a way to quantize a probability measure $\bar{\rho}$ by evolving an empirical measure according to stochastic Langevin dynamics so that the empirical measure flows toward $\bar{\rho}$, our particle method provides a way to quantize $\bar{\rho}$ according to deterministic particle dynamics approximating WMPE. In this way, our method has natural applications to multi-agent coverage algorithms and sampling probability measures. A specific case of our method corresponds exactly to confined mean-field dynamics of training a two-layer neural network for a radial basis function activation function. From this perspective, our convergence result shows that, in the overparametrized regime and as the variance of the radial basis functions goes to zero, the continuum limit is given by WPME. This generalizes previous results, which considered the case of a uniform data distribution, to the more general inhomogeneous setting. As a consequence of our convergence result, we identify conditions on the target function and data distribution for which convexity of the energy landscape emerges in the continuum limit.
    Learning to Rank Normalized Entropy Curves with Differentiable Window Transformation. (arXiv:2301.10443v1 [cs.LG])
    Recent automated machine learning systems often use learning curves ranking models to inform decisions about when to stop unpromising trials and identify better model configurations. In this paper, we present a novel learning curve ranking model specifically tailored for ranking normalized entropy (NE) learning curves, which are commonly used in online advertising and recommendation systems. Our proposed model, self-Adaptive Curve Transformation augmented Relative curve Ranking (ACTR2), features an adaptive curve transformation layer that transforms raw lifetime NE curves into composite window NE curves with the window sizes adaptively optimized based on both the position on the learning curve and the curve's dynamics. We also introduce a novel differentiable indexing method for the proposed adaptive curve transformation, which allows gradients with respect to the discrete indices to flow freely through the curve transformation layer, enabling the learned window sizes to be updated flexibly during training. Additionally, we propose a pairwise curve ranking architecture that directly models the difference between the two learning curves and is better at capturing subtle changes in relative performance that may not be evident when modeling each curve individually as the existing approaches did. Our extensive experiments on a real-world NE curve dataset demonstrate the effectiveness of each key component of ACTR2 and its improved performance over the state-of-the-art.
    MLPGradientFlow: going with the flow of multilayer perceptrons (and finding minima fast and accurately). (arXiv:2301.10638v1 [cs.LG])
    MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot \theta = -\nabla \mathcal L(\theta; \mathcal D)$, where $\theta$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes have better accuracy and convergence speed than gradient descent with the Adam optimizer. However, we find Newton's method and approximations like BFGS preferable to find fixed points (local and global minima of $\mathcal L$) efficiently and accurately. For small networks and data sets, gradients are usually computed faster than in pytorch and Hessian are computed at least $5\times$ faster. Additionally, the package features an integrator for a teacher-student setup with bias-free, two-layer networks trained with standard Gaussian input in the limit of infinite data. The code is accessible at https://github.com/jbrea/MLPGradientFlow.jl.  ( 2 min )
    Understanding and Improving Deep Graph Neural Networks: A Probabilistic Graphical Model Perspective. (arXiv:2301.10536v1 [cs.LG])
    Recently, graph-based models designed for downstream tasks have significantly advanced research on graph neural networks (GNNs). GNN baselines based on neural message-passing mechanisms such as GCN and GAT perform worse as the network deepens. Therefore, numerous GNN variants have been proposed to tackle this performance degradation problem, including many deep GNNs. However, a unified framework is still lacking to connect these existing models and interpret their effectiveness at a high level. In this work, we focus on deep GNNs and propose a novel view for understanding them. We establish a theoretical framework via inference on a probabilistic graphical model. Given the fixed point equation (FPE) derived from the variational inference on the Markov random fields, the deep GNNs, including JKNet, GCNII, DGCN, and the classical GNNs, such as GCN, GAT, and APPNP, can be regarded as different approximations of the FPE. Moreover, given this framework, more accurate approximations of FPE are brought, guiding us to design a more powerful GNN: coupling graph neural network (CoGNet). Extensive experiments are carried out on citation networks and natural language processing downstream tasks. The results demonstrate that the CoGNet outperforms the SOTA models.  ( 2 min )
    DEJA VU: Continual Model Generalization For Unseen Domains. (arXiv:2301.10418v1 [cs.LG])
    In real-world applications, deep learning models often run in non-stationary environments where the target data distribution continually shifts over time. There have been numerous domain adaptation (DA) methods in both online and offline modes to improve cross-domain adaptation ability. However, these DA methods typically only provide good performance after a long period of adaptation, and perform poorly on new domains before and during adaptation - in what we call the "Unfamiliar Period", especially when domain shifts happen suddenly and significantly. On the other hand, domain generalization (DG) methods have been proposed to improve the model generalization ability on unadapted domains. However, existing DG works are ineffective for continually changing domains due to severe catastrophic forgetting of learned knowledge. To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains. RaTP includes a training-free data augmentation module to prepare data for TDG, a novel pseudo-labeling mechanism to provide reliable supervision for TDA, and a prototype contrastive alignment algorithm to align different domains for achieving TDG, TDA and FA. Extensive experiments on Digits, PACS, and DomainNet demonstrate that RaTP significantly outperforms state-of-the-art works from Continual DA, Source-Free DA, Test-Time/Online DA, Single DG, Multiple DG and Unified DA&DG in TDG, and achieves comparable TDA and FA capabilities.  ( 2 min )
    Off-Policy Evaluation for Action-Dependent Non-Stationary Environments. (arXiv:2301.10330v1 [cs.LG])
    Methods for sequential decision-making are often built upon a foundational assumption that the underlying decision process is stationary. This limits the application of such methods because real-world problems are often subject to changes due to external factors (passive non-stationarity), changes induced by interactions with the system itself (active non-stationarity), or both (hybrid non-stationarity). In this work, we take the first steps towards the fundamental challenge of on-policy and off-policy evaluation amidst structured changes due to active, passive, or hybrid non-stationarity. Towards this goal, we make a higher-order stationarity assumption such that non-stationarity results in changes over time, but the way changes happen is fixed. We propose, OPEN, an algorithm that uses a double application of counterfactual reasoning and a novel importance-weighted instrument-variable regression to obtain both a lower bias and a lower variance estimate of the structure in the changes of a policy's past performances. Finally, we show promising results on how OPEN can be used to predict future performances for several domains inspired by real-world applications that exhibit non-stationarity.  ( 2 min )
    Exact Fractional Inference via Re-Parametrization & Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms. (arXiv:2301.10369v1 [cs.LG])
    Inference efforts -- required to compute partition function, $Z$, of an Ising model over a graph of $N$ ``spins" -- are most likely exponential in $N$. Efficient variational methods, such as Belief Propagation (BP) and Tree Re-Weighted (TRW) algorithms, compute $Z$ approximately minimizing respective (BP- or TRW-) free energy. We generalize the variational scheme building a $\lambda$-fractional-homotopy, $Z^{(\lambda)}$, where $\lambda=0$ and $\lambda=1$ correspond to TRW- and BP-approximations, respectively, and $Z^{(\lambda)}$ decreases with $\lambda$ monotonically. Moreover, this fractional scheme guarantees that in the attractive (ferromagnetic) case $Z^{(TRW)}\geq Z^{(\lambda)}\geq Z^{(BP)}$, and there exists a unique (``exact") $\lambda_*$ such that, $Z=Z^{(\lambda_*)}$. Generalizing the re-parametrization approach of \cite{wainwright_tree-based_2002} and the loop series approach of \cite{chertkov_loop_2006}, we show how to express $Z$ as a product, $\forall \lambda:\ Z=Z^{(\lambda)}{\cal Z}^{(\lambda)}$, where the multiplicative correction, ${\cal Z}^{(\lambda)}$, is an expectation over a node-independent probability distribution built from node-wise fractional marginals. Our theoretical analysis is complemented by extensive experiments with models from Ising ensembles over planar and random graphs of medium- and large- sizes. The empirical study yields a number of interesting observations, such as (a) ability to estimate ${\cal Z}^{(\lambda)}$ with $O(N^4)$ fractional samples; (b) suppression of $\lambda_*$ fluctuations with increase in $N$ for instances from a particular random Ising ensemble.
    On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems. (arXiv:2301.10587v1 [cs.SD])
    The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these practices on resource utilization and more importantly network performance is not well documented. This paper is an empirical study of the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
    Audience-Centric Natural Language Generation via Style Infusion. (arXiv:2301.10283v1 [cs.CL])
    Adopting contextually appropriate, audience-tailored linguistic styles is critical to the success of user-centric language generation systems (e.g., chatbots, computer-aided writing, dialog systems). While existing approaches demonstrate textual style transfer with large volumes of parallel or non-parallel data, we argue that grounding style on audience-independent external factors is innately limiting for two reasons. First, it is difficult to collect large volumes of audience-specific stylistic data. Second, some stylistic objectives (e.g., persuasiveness, memorability, empathy) are hard to define without audience feedback. In this paper, we propose the novel task of style infusion - infusing the stylistic preferences of audiences in pretrained language generation models. Since humans are better at pairwise comparisons than direct scoring - i.e., is Sample-A more persuasive/polite/empathic than Sample-B - we leverage limited pairwise human judgments to bootstrap a style analysis model and augment our seed set of judgments. We then infuse the learned textual style in a GPT-2 based text generator while balancing fluency and style adoption. With quantitative and qualitative assessments, we show that our infusion approach can generate compelling stylized examples with generic text prompts. The code and data are accessible at https://github.com/CrowdDynamicsLab/StyleInfusion.  ( 2 min )
    NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification. (arXiv:2108.06158v4 [cs.LG] UPDATED)
    Gene-disease associations are fundamental for understanding disease etiology and developing effective interventions and treatments. Identifying genes not yet associated with a disease due to a lack of studies is a challenging task in which prioritization based on prior knowledge is an important element. The computational search for new candidate disease genes may be eased by positive-unlabeled learning, the machine learning setting in which only a subset of instances are labeled as positive while the rest of the data set is unlabeled. In this work, we propose a set of effective network-based features to be used in a novel Markov diffusion-based multi-class labeling strategy for putative disease gene discovery. The performances of the new labeling algorithm and the effectiveness of the proposed features have been tested on ten different disease data sets using three machine learning algorithms. The new features have been compared against classical topological and functional/ontological features and a set of network- and biological-derived features already used in gene discovery tasks. The predictive power of the integrated methodology in searching for new disease genes has been found to be competitive against state-of-the-art algorithms.
    Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding. (arXiv:2111.15363v2 [cs.CV] UPDATED)
    Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves \sota performance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts).
    Lightweight Neural Architecture Search for Temporal Convolutional Networks at the Edge. (arXiv:2301.10281v1 [cs.LG])
    Neural Architecture Search (NAS) is quickly becoming the go-to approach to optimize the structure of Deep Learning (DL) models for complex tasks such as Image Classification or Object Detection. However, many other relevant applications of DL, especially at the edge, are based on time-series processing and require models with unique features, for which NAS is less explored. This work focuses in particular on Temporal Convolutional Networks (TCNs), a convolutional model for time-series processing that has recently emerged as a promising alternative to more complex recurrent architectures. We propose the first NAS tool that explicitly targets the optimization of the most peculiar architectural parameters of TCNs, namely dilation, receptive-field and number of features in each layer. The proposed approach searches for networks that offer good trade-offs between accuracy and number of parameters/operations, enabling an efficient deployment on embedded platforms. We test the proposed NAS on four real-world, edge-relevant tasks, involving audio and bio-signals. Results show that, starting from a single seed network, our method is capable of obtaining a rich collection of Pareto optimal architectures, among which we obtain models with the same accuracy as the seed, and 15.9-152x fewer parameters. Compared to three state-of-the-art NAS tools, ProxylessNAS, MorphNet and FBNetV2, our method explores a larger search space for TCNs (up to 10^12x) and obtains superior solutions, while requiring low GPU memory and search time. We deploy our NAS outputs on two distinct edge devices, the multicore GreenWaves Technology GAP8 IoT processor and the single-core STMicroelectronics STM32H7 microcontroller. With respect to the state-of-the-art hand-tuned models, we reduce latency and energy of up to 5.5x and 3.8x on the two targets respectively, without any accuracy loss.  ( 3 min )
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.  ( 2 min )
    Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute. (arXiv:2301.10448v1 [cs.CL])
    Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size.  ( 2 min )
    One Model for All Domains: Collaborative Domain-Prefix Tuning for Cross-Domain NER. (arXiv:2301.10410v1 [cs.CL])
    Cross-domain NER is a challenging task to address the low-resource problem in practical scenarios. Previous typical solutions mainly obtain a NER model by pre-trained language models (PLMs) with data from a rich-resource domain and adapt it to the target domain. Owing to the mismatch issue among entity types in different domains, previous approaches normally tune all parameters of PLMs, ending up with an entirely new NER model for each domain. Moreover, current models only focus on leveraging knowledge in one general source domain while failing to successfully transfer knowledge from multiple sources to the target. To address these issues, we introduce Collaborative Domain-Prefix Tuning for cross-domain NER (CP-NER) based on text-to-text generative PLMs. Specifically, we present text-to-text generation grounding domain-related instructors to transfer knowledge to new domain NER tasks without structural modifications. We utilize frozen PLMs and conduct collaborative domain-prefix tuning to stimulate the potential of PLMs to handle NER tasks across various domains. Experimental results on the Cross-NER benchmark show that the proposed approach has flexible transfer ability and performs better on both one-source and multiple-source cross-domain NER tasks. Codes will be available in https://github.com/zjunlp/DeepKE/tree/main/example/ner/cross.  ( 2 min )
    Editing Language Model-based Knowledge Graph Embeddings. (arXiv:2301.10405v1 [cs.CL])
    Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, which are challenging to modify without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. The proposed task aims to enable data-efficient and fast updates to KG embeddings without damaging the performance of the rest. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hyper network to edit/add facts. Comprehensive experimental results demonstrate that KGEditor can perform better when updating specific facts while not affecting the rest with low training resources. Code and datasets will be available in https://github.com/zjunlp/PromptKG/tree/main/deltaKG.  ( 2 min )
    Designing Data: Proactive Data Collection and Iteration for Machine Learning. (arXiv:2301.10319v1 [cs.HC])
    Lack of diversity in data collection has caused significant failures in machine learning (ML) applications. While ML developers perform post-collection interventions, these are time intensive and rarely comprehensive. Thus, new methods to track and manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability. We present designing data, an iterative, bias mitigating approach to data collection connecting HCI concepts with ML techniques. Our process includes (1) Pre-Collection Planning, to reflexively prompt and document expected data distributions; (2) Collection Monitoring, to systematically encourage sampling diversity; and (3) Data Familiarity, to identify samples that are unfamiliar to a model through Out-of-Distribution (OOD) methods. We instantiate designing data through our own data collection and applied ML case study. We find models trained on "designed" datasets generalize better across intersectional groups than those trained on similarly sized but less targeted datasets, and that data familiarity is effective for debugging datasets.  ( 2 min )
    Weakly Supervised Headline Dependency Parsing. (arXiv:2301.10371v1 [cs.CL])
    English news headlines form a register with unique syntactic properties that have been documented in linguistics literature since the 1930s. However, headlines have received surprisingly little attention from the NLP syntactic parsing community. We aim to bridge this gap by providing the first news headline corpus of Universal Dependencies annotated syntactic dependency trees, which enables us to evaluate existing state-of-the-art dependency parsers on news headlines. To improve English news headline parsing accuracies, we develop a projection method to bootstrap silver training data from unlabeled news headline-article lead sentence pairs. Models trained on silver headline parses demonstrate significant improvements in performance over models trained solely on gold-annotated long-form texts. Ultimately, we find that, although projected silver training data improves parser performance across different news outlets, the improvement is moderated by constructions idiosyncratic to outlet.  ( 2 min )
    Exact and rapid linear clustering of networks with dynamic programming. (arXiv:2301.10403v1 [cs.SI])
    We study the problem of clustering networks whose nodes have imputed or physical positions in a single dimension, such as prestige hierarchies or the similarity dimension of hyperbolic embeddings. Existing algorithms, such as the critical gap method and other greedy strategies, only offer approximate solutions. Here, we introduce a dynamic programming approach that returns provably optimal solutions in polynomial time -- O(n^2) steps -- for a broad class of clustering objectives. We demonstrate the algorithm through applications to synthetic and empirical networks, and show that it outperforms existing heuristics by a significant margin, with a similar execution time.  ( 2 min )
    Predicting mental health using social media: A roadmap for future development. (arXiv:2301.10453v1 [cs.IR])
    Mental disorders such as depression and suicidal ideation are hazardous, affecting more than 300 million people over the world. However, on social media, mental disorder symptoms can be observed, and automated approaches are increasingly capable of detecting them. The considerable number of social media users and the tremendous quantity of user-generated data on social platforms provide a unique opportunity for researchers to distinguish patterns that correlate with mental status. This research offers a roadmap for analysis, where mental state detection can be based on machine learning techniques. We describe the common approaches for predicting and identifying the disorder using user-generated content. This research is organized according to the data collection, feature extraction, and prediction algorithms. Furthermore, we review several recent studies conducted to explore different features of candidate profiles and their analytical methods. Following, we debate various aspects of the development of experimental auto-detection frameworks for identifying users who suffer from disorders, and we conclude with a discussion of future trends. The introduced methods can help complement screening procedures, identify at-risk people through social media monitoring on a large scale, and make disorders easier to treat in the future.  ( 2 min )
    Parameterizing the cost function of Dynamic Time Warping with application to time series classification. (arXiv:2301.10350v1 [cs.LG])
    Dynamic Time Warping (DTW) is a popular time series distance measure that aligns the points in two series with one another. These alignments support warping of the time dimension to allow for processes that unfold at differing rates. The distance is the minimum sum of costs of the resulting alignments over any allowable warping of the time dimension. The cost of an alignment of two points is a function of the difference in the values of those points. The original cost function was the absolute value of this difference. Other cost functions have been proposed. A popular alternative is the square of the difference. However, to our knowledge, this is the first investigation of both the relative impacts of using different cost functions and the potential to tune cost functions to different tasks. We do so in this paper by using a tunable cost function {\lambda}{\gamma} with parameter {\gamma}. We show that higher values of {\gamma} place greater weight on larger pairwise differences, while lower values place greater weight on smaller pairwise differences. We demonstrate that training {\gamma} significantly improves the accuracy of both the DTW nearest neighbor and Proximity Forest classifiers.  ( 2 min )
    Data Consistent Deep Rigid MRI Motion Correction. (arXiv:2301.10365v1 [eess.IV])
    Motion artifacts are a pervasive problem in MRI, leading to misdiagnosis or mischaracterization in population-level imaging studies. Current retrospective rigid intra-slice motion correction techniques jointly optimize estimates of the image and the motion parameters. In this paper, we use a deep network to reduce the joint image-motion parameter search to a search over rigid motion parameters alone. Our network produces a reconstruction as a function of two inputs: corrupted k-space data and motion parameters. We train the network using simulated, motion-corrupted k-space data generated from known motion parameters. At test-time, we estimate unknown motion parameters by minimizing a data consistency loss between the motion parameters, the network-based image reconstruction given those parameters, and the acquired measurements. Intra-slice motion correction experiments on simulated and realistic 2D fast spin echo brain MRI achieve high reconstruction fidelity while retaining the benefits of explicit data consistency-based optimization. Our code is publicly available at https://www.github.com/nalinimsingh/neuroMoCo.  ( 2 min )
    Multilingual Multiaccented Multispeaker TTS with RADTTS. (arXiv:2301.10335v1 [cs.SD])
    We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfer capabilities. To overcome this, we present a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS with explicit control over accent, language, speaker and fine-grained $F_0$ and energy features. Our proposed model does not rely on bilingual training data. We demonstrate an ability to control synthesized accent for any speaker in an open-source dataset comprising of 7 accents. Human subjective evaluation demonstrates that our model can better retain a speaker's voice and accent quality than controlled baselines while synthesizing fluent speech in all target languages and accents in our dataset.  ( 2 min )
    Generating Multidimensional Clusters With Support Lines. (arXiv:2301.10327v1 [cs.LG])
    Synthetic data is essential for assessing clustering techniques, complementing and extending real data, and allowing for a more complete coverage of a given problem's space. In turn, synthetic data generators have the potential of creating vast amounts of data -- a crucial activity when real-world data is at premium -- while providing a well-understood generation procedure and an interpretable instrument for methodically investigating cluster analysis algorithms. Here, we present \textit{Clugen}, a modular procedure for synthetic data generation, capable of creating multidimensional clusters supported by line segments using arbitrary distributions. \textit{Clugen} is open source, 100\% unit tested and fully documented, and is available for the Python, R, Julia and MATLAB/Octave ecosystems. We demonstrate that our proposal is able to produce rich and varied results in various dimensions, is fit for use in the assessment of clustering algorithms, and has the potential to be a widely used framework in diverse clustering-related research tasks.  ( 2 min )
    Score Matching via Differentiable Physics. (arXiv:2301.10250v1 [cs.LG])
    Diffusion models based on stochastic differential equations (SDEs) gradually perturb a data distribution $p(\mathbf{x})$ over time by adding noise to it. A neural network is trained to approximate the score $\nabla_\mathbf{x} \log p_t(\mathbf{x})$ at time $t$, which can be used to reverse the corruption process. In this paper, we focus on learning the score field that is associated with the time evolution according to a physics operator in the presence of natural non-deterministic physical processes like diffusion. A decisive difference to previous methods is that the SDE underlying our approach transforms the state of a physical system to another state at a later time. For that purpose, we replace the drift of the underlying SDE formulation with a differentiable simulator or a neural network approximation of the physics. We propose different training strategies based on the so-called probability flow ODE to fit a training set of simulation trajectories and discuss their relation to the score matching objective. For inference, we sample plausible trajectories that evolve towards a given end state using the reverse-time SDE and demonstrate the competitiveness of our approach for different challenging inverse problems.  ( 2 min )
    Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction. (arXiv:2301.10309v1 [cs.LG])
    Crosslingual conditional generation (e.g., machine translation) has long enjoyed the benefits of scaling. Nonetheless, there are still issues that scale alone may not overcome. A source query in one language, for instance, may yield several translation options in another language without any extra context. Only one translation could be acceptable however, depending on the translator's preferences and goals. Choosing the incorrect option might significantly affect translation usefulness and quality. We propose a novel method interactive-chain prompting -- a series of question, answering and generation intermediate steps between a Translator model and a User model -- that reduces translations into a list of subproblems addressing ambiguities and then resolving such subproblems before producing the final text to be translated. To check ambiguity resolution capabilities and evaluate translation quality, we create a dataset exhibiting different linguistic phenomena which leads to ambiguities at inference for four languages. To encourage further exploration in this direction, we release all datasets. We note that interactive-chain prompting, using eight interactions as exemplars, consistently surpasses prompt-based methods with direct access to background information to resolve ambiguities.  ( 2 min )
    Towards Robust Metrics for Concept Representation Evaluation. (arXiv:2301.10367v1 [cs.LG])
    Recent work on interpretability has focused on concept-based explanations, where deep learning models are explained in terms of high-level units of information, referred to as concepts. Concept learning models, however, have been shown to be prone to encoding impurities in their representations, failing to fully capture meaningful features of their inputs. While concept learning lacks metrics to measure such phenomena, the field of disentanglement learning has explored the related notion of underlying factors of variation in the data, with plenty of metrics to measure the purity of such factors. In this paper, we show that such metrics are not appropriate for concept learning and propose novel metrics for evaluating the purity of concept representations in both approaches. We show the advantage of these metrics over existing ones and demonstrate their utility in evaluating the robustness of concept representations and interventions performed on them. In addition, we show their utility for benchmarking state-of-the-art methods from both families and find that, contrary to common assumptions, supervision alone may not be sufficient for pure concept representations.  ( 2 min )
    Language Model Detoxification in Dialogue with Contextualized Stance Control. (arXiv:2301.10368v1 [cs.CL])
    To reduce the toxic degeneration in a pretrained Language Model (LM), previous work on Language Model detoxification has focused on reducing the toxicity of the generation itself (self-toxicity) without consideration of the context. As a result, a type of implicit offensive language where the generations support the offensive language in the context is ignored. Different from the LM controlling tasks in previous work, where the desired attributes are fixed for generation, the desired stance of the generation depends on the offensiveness of the context. Therefore, we propose a novel control method to do context-dependent detoxification with the stance taken into consideration. We introduce meta prefixes to learn the contextualized stance control strategy and to generate the stance control prefix according to the input context. The generated stance prefix is then combined with the toxicity control prefix to guide the response generation. Experimental results show that our proposed method can effectively learn the context-dependent stance control strategies while keeping a low self-toxicity of the underlying LM.  ( 2 min )
    ClimaX: A foundation model for weather and climate. (arXiv:2301.10343v1 [cs.LG])
    Most state-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models of the atmosphere. These approaches aim to model the non-linear dynamics and complex interactions between multiple variables, which are challenging to approximate. Additionally, many such numerical models are computationally intensive, especially when modeling the atmospheric phenomenon at a fine-grained spatial and temporal resolution. Recent data-driven approaches based on machine learning instead aim to directly solve a downstream forecasting or projection task by learning a data-driven functional mapping using deep neural networks. However, these networks are trained using curated and homogeneous climate datasets for specific spatiotemporal tasks, and thus lack the generality of numerical models. We develop and demonstrate ClimaX, a flexible and generalizable deep learning model for weather and climate science that can be trained using heterogeneous datasets spanning different variables, spatio-temporal coverage, and physical groundings. ClimaX extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility. ClimaX is pre-trained with a self-supervised learning objective on climate datasets derived from CMIP6. The pre-trained ClimaX can then be fine-tuned to address a breadth of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal scales unseen during pretraining. Compared to existing data-driven baselines, we show that this generality in ClimaX results in superior performance on benchmarks for weather forecasting and climate projections, even when pretrained at lower resolutions and compute budgets.  ( 2 min )
    Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics For Advection-Dominated Systems. (arXiv:2301.10391v1 [cs.LG])
    We present a data-driven, space-time continuous framework to learn surrogatemodels for complex physical systems described by advection-dominated partialdifferential equations. Those systems have slow-decaying Kolmogorovn-widththat hinders standard methods, including reduced order modeling, from producinghigh-fidelity simulations at low cost. In this work, we construct hypernetwork-based latent dynamical models directly on the parameter space of a compactrepresentation network. We leverage the expressive power of the network and aspecially designed consistency-inducing regularization to obtain latent trajectoriesthat are both low-dimensional and smooth. These properties render our surrogatemodels highly efficient at inference time. We show the efficacy of our frameworkby learning models that generate accurate multi-step rollout predictions at muchfaster inference speed compared to competitors, for several challenging examples.  ( 2 min )
    Learned Interferometric Imaging for the SPIDER Instrument. (arXiv:2301.10260v1 [astro-ph.IM])
    The Segmented Planar Imaging Detector for Electro-Optical Reconnaissance (SPIDER) is an optical interferometric imaging device that aims to offer an alternative to the large space telescope designs of today with reduced size, weight and power consumption. This is achieved through interferometric imaging. State-of-the-art methods for reconstructing images from interferometric measurements adopt proximal optimization techniques, which are computationally expensive and require handcrafted priors. In this work we present two data-driven approaches for reconstructing images from measurements made by the SPIDER instrument. These approaches use deep learning to learn prior information from training data, increasing the reconstruction quality, and significantly reducing the computation time required to recover images by orders of magnitude. Reconstruction time is reduced to ${\sim} 10$ milliseconds, opening up the possibility of real-time imaging with SPIDER for the first time. Furthermore, we show that these methods can also be applied in domains where training data is scarce, such as astronomical imaging, by leveraging transfer learning from domains where plenty of training data are available.  ( 2 min )
  • Open

    Two Efficient Ridge Solutions for the Incremental Broad Learning System on Added Inputs. (arXiv:1911.07292v5 [cs.LG] UPDATED)
    This paper proposes the recursive and square-root BLS algorithms to improve the original BLS for new added inputs, which utilize the inverse and inverse Cholesky factor of the Hermitian matrix in the ridge inverse, respectively, to update the ridge solution. The recursive BLS updates the inverse by the matrix inversion lemma, while the square-root BLS updates the upper-triangular inverse Cholesky factor by multiplying it with an upper-triangular intermediate matrix. When the added p training samples are more than the total k nodes in the network, i.e., p>k, the inverse of a sum of matrices is applied to take a smaller matrix inversion or inverse Cholesky factorization. For the distributed BLS with data-parallelism, we introduce the parallel implementation of the square-root BLS, which is deduced from the parallel implementation of the inverse Cholesky factorization. The original BLS based on the generalized inverse with the ridge regression assumes the ridge parameter lamda->0 in the ridge inverse. When lambda->0 is not satisfied, the numerical experiments on the MNIST and NORB datasets show that both the proposed ridge solutions improve the testing accuracy of the original BLS, and the improvement becomes more significant as lambda is bigger. On the other hand, compared to the original BLS, both the proposed BLS algorithms theoretically require less complexities, and are significantly faster in the simulations on the MNIST dataset. The speedups in total training time of the recursive and square-root BLS algorithms over the original BLS are 4.41 and 6.92 respectively when p > k, and are 2.80 and 1.59 respectively when p < k.
    Certifying Neural Network Robustness to Random Input Noise from Samples. (arXiv:2010.07532v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial input uncertainty, but researchers have recently shown a need for methods that consider random uncertainty. In this paper, we propose a novel robustness certification method that upper bounds the probability of misclassification when the input noise follows an arbitrary probability distribution. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to replace the optimization constraints. The resulting optimization reduces to a linear program with an analytical solution. Furthermore, we develop a sufficient condition on the number of samples needed to make the misclassification bound hold with overwhelming probability. Our case studies on MNIST classifiers show that this method is able to certify a uniform infinity-norm uncertainty region with a radius of nearly 50 times larger than what the current state-of-the-art method can certify.
    Infinitesimal gradient boosting. (arXiv:2104.13208v2 [stat.ML] UPDATED)
    We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
    Batch Bayesian Optimization on Permutations using the Acquisition Weighted Kernel. (arXiv:2102.13382v2 [stat.ML] UPDATED)
    In this work we propose a batch Bayesian optimization method for combinatorial problems on permutations, which is well suited for expensive-to-evaluate objectives. We first introduce LAW, an efficient batch acquisition method based on determinantal point processes using the acquisition weighted kernel. Relying on multiple parallel evaluations, LAW enables accelerated search on combinatorial spaces. We then apply the framework to permutation problems, which have so far received little attention in the Bayesian Optimization literature, despite their practical importance. We call this method LAW2ORDER. On the theoretical front, we prove that LAW2ORDER has vanishing simple regret by showing that the batch cumulative regret is sublinear. Empirically, we assess the method on several standard combinatorial problems involving permutations such as quadratic assignment, flowshop scheduling and the traveling salesman, as well as on a structure learning task.
    Data-Driven Certification of Neural Networks with Random Input Noise. (arXiv:2010.01171v2 [cs.LG] UPDATED)
    Methods to certify the robustness of neural networks in the presence of input uncertainty are vital in safety-critical settings. Most certification methods in the literature are designed for adversarial or worst-case inputs, but researchers have recently shown a need for methods that consider random input noise. In this paper, we examine the setting where inputs are subject to random noise coming from an arbitrary probability distribution. We propose a robustness certification method that lower-bounds the probability that network outputs are safe. This bound is cast as a chance-constrained optimization problem, which is then reformulated using input-output samples to make the optimization constraints tractable. We develop sufficient conditions for the resulting optimization to be convex, as well as on the number of samples needed to make the robustness bound hold with overwhelming probability. We show for a special case that the proposed optimization reduces to an intuitive closed-form solution. Case studies on synthetic, MNIST, and CIFAR-10 networks experimentally demonstrate that this method is able to certify robustness against various input noise regimes over larger uncertainty regions than prior state-of-the-art techniques.
    Non-Asymptotic Analysis of a UCB-based Top Two Algorithm. (arXiv:2210.05431v2 [stat.ML] UPDATED)
    A Top Two sampling rule for bandit identification is a method which selects the next arm to sample from among two candidate arms, a leader and a challenger. Due to their simplicity and good empirical performance, they have received increased attention in recent years. However, for fixed-confidence best arm identification, theoretical guarantees for Top Two methods have only been obtained in the asymptotic regime, when the error level vanishes. In this paper, we derive the first non-asymptotic upper bound on the expected sample complexity of a Top Two algorithm, which holds for any error level. Our analysis highlights sufficient properties for a regret minimization algorithm to be used as leader. These properties are satisfied by the UCB algorithm, and our proposed UCB-based Top Two algorithm simultaneously enjoys non-asymptotic guarantees and competitive empirical performance.
    RDIS: Random Drop Imputation with Self-Training for Incomplete Time Series Data. (arXiv:2010.10075v2 [cs.LG] UPDATED)
    Time-series data with missing values are commonly encountered in many fields, such as healthcare, meteorology, and robotics. The imputation aims to fill the missing values with valid values. Most imputation methods trained the models implicitly because missing values have no ground truth. In this paper, we propose Random Drop Imputation with Self-training (RDIS), a novel training method for time-series data imputation models. In RDIS, we generate extra missing values by applying a random drop on the observed values in incomplete data. We can explicitly train the imputation models by filling in the randomly dropped values. In addition, we adopt self-training with pseudo values to exploit the original missing values. To improve the quality of pseudo values, we set the threshold and filter them by calculating the entropy. To verify the effectiveness of RDIS on the time series imputation, we test RDIS to various imputation models and achieve competitive results on two real-world datasets.
    Posterior Covariance Information Criterion for Weighted Inference. (arXiv:2106.13694v4 [stat.ME] UPDATED)
    For predictive evaluation based on quasi-posterior distributions, we develop a new information criterion, the posterior covariance information criterion (PCIC. PCIC generalises the widely applicable information criterion WAIC so as to effectively handle predictive scenarios where likelihoods for the estimation and the evaluation of the model may be different. A typical example of such scenarios is the weighted likelihood inference, including prediction under covariate shift and counterfactual prediction. The proposed criterion utilises a posterior covariance form and is computed by using only one Markov chain Monte Carlo run. Through numerical examples, we demonstrate how PCIC can apply in practice. Further, we show that PCIC is asymptotically unbiased to the quasi-Bayesian generalization error under mild conditions in weighted inference with both regular and singular statistical models.
    Meta-Learning PAC-Bayes Priors in Model Averaging. (arXiv:1912.11252v3 [cs.LG] UPDATED)
    Nowadays model uncertainty has become one of the most important problems in both academia and industry. In this paper, we mainly consider the scenario in which we have a common model set used for model averaging instead of selecting a single final model via a model selection procedure to account for this model's uncertainty to improve the reliability and accuracy of inferences. Here one main challenge is to learn the prior over the model set. To tackle this problem, we propose two data-based algorithms to get proper priors for model averaging. One is for meta-learner, the analysts should use historical similar tasks to extract the information about the prior. The other one is for base-learner, a subsampling method is used to deal with the data step by step. Theoretically, an upper bound of risk for our algorithm is presented to guarantee the performance of the worst situation. In practice, both methods perform well in simulations and real data studies, especially with poor-quality data.
    Truthful Self-Play. (arXiv:2106.03007v4 [stat.ML] UPDATED)
    We present a general optimization framework for emergent belief-state representation without any supervision. We employed the common configuration of multiagent reinforcement learning and communication to improve exploration coverage over an environment by leveraging the knowledge of each agent. In this paper, we obtained that recurrent neural nets (RNNs) with shared weights are highly biased in partially observable environments because of their noncooperativity. To address this, we designated an unbiased version of self-play via mechanism design, also known as reverse game theory, to clarify unbiased knowledge at the Bayesian Nash equilibrium. The key idea is to add imaginary rewards using the peer prediction mechanism, i.e., a mechanism for mutually criticizing information in a decentralized environment. Numerical analyses, including StarCraft exploration tasks with up to 20 agents and off-the-shelf RNNs, demonstrate the state-of-the-art performance.
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v2 [cs.LG] UPDATED)
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    Imitating Human Behaviour with Diffusion Models. (arXiv:2301.10677v1 [cs.AI])
    Diffusion models have emerged as powerful generative models in the text-to-image domain. This paper studies their application as observation-to-action models for imitating human behaviour in sequential environments. Human behaviour is stochastic and multimodal, with structured correlations between action dimensions. Meanwhile, standard modelling choices in behaviour cloning are limited in their expressiveness and may introduce bias into the cloned policy. We begin by pointing out the limitations of these choices. We then propose that diffusion models are an excellent fit for imitating human behaviour, since they learn an expressive distribution over the joint action space. We introduce several innovations to make diffusion models suitable for sequential environments; designing suitable architectures, investigating the role of guidance, and developing reliable sampling strategies. Experimentally, diffusion models closely match human demonstrations in a simulated robotic control task and a modern 3D gaming environment.
    Signature Methods in Machine Learning. (arXiv:2206.14674v3 [stat.ML] UPDATED)
    Signature-based techniques give mathematical insight into the interactions between complex streams of evolving data. These insights can be quite naturally translated into numerical approaches to understanding streamed data, and perhaps because of their mathematical precision, have proved useful in analysing streamed data in situations where the data is irregular, and not stationary, and the dimension of the data and the sample sizes are both moderate. Understanding streamed multi-modal data is exponential: a word in $n$ letters from an alphabet of size $d$ can be any one of $d^n$ messages. Signatures remove the exponential amount of noise that arises from sampling irregularity, but an exponential amount of information still remain. This survey aims to stay in the domain where that exponential scaling can be managed directly. Scalability issues are an important challenge in many problems but would require another survey article and further ideas. This survey describes a range of contexts where the data sets are small enough to remove the possibility of massive machine learning, and the existence of small sets of context free and principled features can be used effectively. The mathematical nature of the tools can make their use intimidating to non-mathematicians. The examples presented in this article are intended to bridge this communication gap and provide tractable working examples drawn from the machine learning context. Notebooks are available online for several of these examples. This survey builds on the earlier paper of Ilya Chevryev and Andrey Kormilitzin which had broadly similar aims at an earlier point in the development of this machinery. This article illustrates how the theoretical insights offered by signatures are simply realised in the analysis of application data in a way that is largely agnostic to the data type.
    A Unified and Constructive Framework for the Universality of Neural Networks. (arXiv:2112.14877v3 [cs.LG] UPDATED)
    One of the reasons why many neural networks are capable of replicating complicated tasks or functions is their universal property. Though the past few decades have seen tremendous advances in theories of neural networks, a single constructive framework for neural network universality remains unavailable. This paper is the first effort to provide a unified and constructive framework for the universality of a large class of activation functions including most of existing ones. At the heart of the framework is the concept of neural network approximate identity (nAI). The main result is: {\em any nAI activation function is universal}. It turns out that most of existing activation functions are nAI, and thus universal in the space of continuous functions on compacta. The framework induces {\bf several advantages} over the contemporary counterparts. First, it is constructive with elementary means from functional analysis, probability theory, and numerical analysis. Second, it is the first unified attempt that is valid for most of existing activation functions. Third, as a by product, the framework provides the first universality proof for some of the existing activation functions including Mish, SiLU, ELU, GELU, and etc. Fourth, it provides new proofs for most activation functions. Fifth, it discovers new activation functions with guaranteed universality property. Sixth, for a given activation and error tolerance, the framework provides precisely the architecture of the corresponding one-hidden neural network with predetermined number of neurons, and the values of weights/biases. Seventh, the framework allows us to abstractly present the first universal approximation with favorable non-asymptotic rate.
    Semiparametric discrete data regression with Monte Carlo inference and prediction. (arXiv:2110.12316v5 [stat.ME] UPDATED)
    Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over- or under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesian modeling and algorithmic framework that enables semiparametric regression analysis for discrete data with Monte Carlo (not MCMC) sampling. The proposed approach pairs a nonparametric marginal model with a latent linear regression model to encourage both flexibility and interpretability, and delivers posterior consistency even under model misspecification. For a parametric or large-sample approximation of this model, we identify a class of conjugate priors with (pseudo) closed-form posteriors. All posterior and predictive distributions are available analytically or via Monte Carlo sampling. These tools are broadly useful for linear regression, nonlinear models via basis expansions, and variable selection with discrete data. Simulation studies demonstrate significant advantages in computing, prediction, estimation, and selection relative to existing alternatives. This novel approach is applied to self-reported mental health data that exhibit zero-inflation, overdispersion, boundedness, and heaping.
    Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part V: Sparse Kernel Flows for 132 Chaotic Dynamical Systems. (arXiv:2301.10321v1 [stat.ML])
    Regressing the vector field of a dynamical system from a finite number of observed states is a natural way to learn surrogate models for such systems. A simple and interpretable way to learn a dynamical system from data is to interpolate its vector-field with a data-adapted kernel which can be learned by using Kernel Flows. The method of Kernel Flows is a trainable machine learning method that learns the optimal parameters of a kernel based on the premise that a kernel is good if there is no significant loss in accuracy if half of the data is used. The objective function could be a short-term prediction or some other objective for other variants of Kernel Flows). However, this method is limited by the choice of the base kernel. In this paper, we introduce the method of \emph{Sparse Kernel Flows } in order to learn the ``best'' kernel by starting from a large dictionary of kernels. It is based on sparsifying a kernel that is a linear combination of elemental kernels. We apply this approach to a library of 132 chaotic systems.

  • Open

    Insane face rendering A.I technology..
    submitted by /u/KTMark [link] [comments]  ( 40 min )
    AI Music - Eminem - 'Slim shady is alive'
    submitted by /u/DANGERD0OM [link] [comments]  ( 40 min )
    Finding the right AI for a specific task
    Hi all, We're developing an internal application that groups customers together based on attributes that adhere to a ruleset on how they should be grouped. It does this fine. However, some nuance is then applied via human effort to modify groupings based on some customer notes (a text string) that sometimes dictate that two customers need to be in different groups for x reason, even if the original grouping adheres to the ruleset. The application itself has a UI that sorts customers into columns, which are manipulated by staff via dragging and dropping a customer/customers between one column and another. I had a thought to employ an AI model that compares the original generated grouping config that our code produces against the modified groupings that staff adjust based on that nuance. The idea is that we could analyze the why of a modification and use that insight to generate better default groupings. Is there a model out there that would be ideally suited for this kind of learning? Keen to dive into it further on my own but any recommendations as a starting point would be great. submitted by /u/premiumnougat [link] [comments]  ( 41 min )
    Create Your Chat GPT-3 Web App with Streamlit in Python
    submitted by /u/pasticciociccio [link] [comments]  ( 40 min )
    What do employers and job seekers need to know about artificial intelligence's role in hiring?
    University of Florida - Warrington College of Business's Mo Wang offers advice for the future of work. Full Story: https://explore.research.ufl.edu/the-future-of-work.html#ai-hiring submitted by /u/ufexplore [link] [comments]  ( 40 min )
    BuzzFeed to Use ChatGPT Creator OpenAI to Help Create Quizzes and Other Content
    submitted by /u/trueslicky [link] [comments]  ( 40 min )
    Member of Congress Reads AI-Generated Speech on House Floor
    submitted by /u/dahmedahe [link] [comments]  ( 6 min )
    Synthesizing the Businessmen-Smile
    submitted by /u/walt74 [link] [comments]  ( 45 min )
    AI Dream 150 - ENTERING DREAMWORLD Part2 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    NIST Risk Management Framework Aims to Improve Trustworthiness of Artificial Intelligence
    submitted by /u/Harley109 [link] [comments]  ( 40 min )
    📌[Searchcolab] It's impressive to see how far generative AI has come in the past 5 years. What should we expect the trajectory of this field to be in the next 5 years? Btw the pictures attached are also generated using AI.
    submitted by /u/Maleficent_Suit1591 [link] [comments]  ( 41 min )
    "Father" from Equilibrium movie
    Imagine we could create an AGI/ASI that will protect our values. Something like "Father" from Equilibrium (but that's a dystopian version). Let's call it modern God. What values should it protect? submitted by /u/chuguruk [link] [comments]  ( 40 min )
    InstructPix2Pix lets you edit images using only text prompts
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    what are your favorite AI subreddit?
    submitted by /u/Chaserivx [link] [comments]  ( 40 min )
    AI "Upscale" With Only 1000 Training Examples(All examples were dogs)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 43 min )
    Proud Pollution A movie script Written by AI
    Movie script: Proud Pollution The website used- (https://www.plot-generator.org.uk/) Be free to share your thoughts on it! ​ Proud Pollution A Screenplay by Mr. Pseudonym EXT. VASQUEZ ROCKS, CALIFORNIA - AFTERNOON Misunderstood piolet FLAMOUS JACK THORNTON is arguing with mean scout MISS HELEN FISH. JACK tries to hug HELEN but she shakes him off. JACK Please, Helen, don't leave me. HELEN I'm sorry Jack, but I'm looking for somebody a bit braver. Somebody who faces his fears head on, inhead-onstead of running away. JACK I am such a person! HELEN frowns. HELEN I'm sorry, Jack. I just don't feel excited by this relationship anymore. HELEN leaves. JACK sits down, looking defeated. Moments later, noble navigator MASTER CUTHBERT MACDONALD barges in looking flustered. JACK Go…  ( 47 min )
    Meta's chief AI scientist says "ChatGPT is not innovative".
    What happened? So, there has been a lot of excitement around OpenAI's ChatGPT which generates natural-language responses to human prompts. But what if... it's not as amazing as we all think it is? Yann LeCun, Meta's chief AI scientist, argues that the program is not innovative. He also states that similar technology has been developed by many companies and research labs, and that ChatGPT is composed of multiple pieces of technology developed over many years by many parties. (sounds salty to me... and I like my cookies sweet!) But maybe Yann has a point! What's happening now? ChatGPT is perceived by many as a unique and innovative program. People are using it everyday to make their lives easier. So, no matter what, ChatGPT is still awesome. What's happening next? It's unclear what will happen next in terms of the development and perception of ChatGPT. However, it can be expected that as AI technology continues to evolve, there will be further advancements on what is possible with this tech. It's likely that ChatGPT will face many spin-offs and competitors this year. If you enjoyed this and want 500+ AI tools, I write a daily AI newsletter: https://chriscookies.beehiiv.com/p/metas-chief-ai-scientist-says-chatgpt-not-innovative-7581 submitted by /u/ZaKodiak [link] [comments]  ( 41 min )
    Looking for some ideas to research about making money using AI!
    Hey yall, I'm looking for ideas of how to make money using language models like ChatGPT. I want to go in depth and research a bit as well as begin designing tutorials based on my experiences of what's the best way to make money. I am open to any suggestions or things that have worked for you all! Thanks! submitted by /u/Chadcash [link] [comments]  ( 41 min )
    What is Atomic AI? - Is AI going to have a Drug development breakthrough soon?
    submitted by /u/BackgroundResult [link] [comments]  ( 40 min )
    AI Video to Fill Missing Frames/Smooth Animation?
    Hey all, Was wondering if you know some kind of AI tool that exists to fill missing frames and therefore smooth animation in animated videos? Trying to get something cleaned up and really nailed in. Thanks! submitted by /u/miseryleech [link] [comments]  ( 40 min )
    ChatGPT: OpenAI’s Last Resort Turns Out To Be A Winner
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    41 AI Written Articles Out Of 77 On CNET Have Plagiarism And Errors
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 6 min )
    Chrome Extension that uses AI to write emails.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I told an AI to freak out on camera. it was ALL made by AI.
    submitted by /u/25dopren [link] [comments]  ( 40 min )
  • Open

    [D] score based vs. Diffusion models
    I know there is a mathematical way to show that the two approaches of score matching models and diffusion models are the same. I wonder, if there in practice/code are the same either? I already tried to find some PyTorch implementations of score based models but didn’t find anything yet - just for diffusion models. submitted by /u/Individual-Cause-616 [link] [comments]  ( 43 min )
    [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation?
    I guess what I'm trying to figure out is, what are the main reasons that DMs are outperforming GANs in text2img generation? Thanks! submitted by /u/TheCockatoo [link] [comments]  ( 46 min )
    [P] A python module to generate optimized prompts & solve different NLP problems using GPT-n based models and return structured python object for easy parsing
    Hi folks, I was working on a personal experimental project related to GPT-3, which I thought of making it open source now. It saves much time while working with LLMs. If you are an industrial researcher or application developer, you probably have worked with GPT-3 apis. A common challenge when utilizing LLMs such as GPT-3 and BLOOM is their tendency to produce uncontrollable & unstructured outputs, making it difficult to use them for various NLP tasks and applications.To address this, we developed Promptify, a library that allows for the use of LLMs to solve NLP problems including Named Entity Recognition, Binary Classification, Multi-Label Classification, and Question-Answering and return a python object for easy parsing to construct additional applications on top of GPT-n based models. Features 🚀 🧙‍♀️ NLP Tasks (NER, Binary Text Classification, Multi-Label Classification etc) in 2 lines of code with no training data required 🔨 Easily add one shot, two shot, or few shot examples to the prompt ✌ Output always provided as a Python object (e.g. list, dictionary) for easy parsing and filtering 💥 Custom examples and samples can be easily added to the prompt 💰 Optimized prompts to reduce OpenAI token costs GITHUB: https://github.com/promptslab/Promptify Examples: https://github.com/promptslab/Promptify/tree/main/examples For quick demo -> Colab I hope it will be helpful in your research. Thanks :) NER example ​ https://preview.redd.it/vnz4mf0i6gea1.png?width=1398&format=png&auto=webp&s=74c70bd9d518423f913c1fb9c68cf2565cf8cffc submitted by /u/aadityaura [link] [comments]  ( 43 min )
    A Watermark for Large Language Models
    submitted by /u/lookinsidemybutthole [link] [comments]  ( 42 min )
    [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
    Dec 2022 paper from Microsoft research: https://arxiv.org/abs/2212.10559v2 Large pretrained language models have shown surprising In-Context Learning (ICL) ability. With a few demonstration input-label pairs, they can predict the label for an unseen input without additional parameter updates. Despite the great success in performance, the working mechanism of ICL still remains an open problem. In order to better understand how ICL works, this paper explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning. submitted by /u/currentscurrents [link] [comments]  ( 43 min )
    [Discussion] Github like alternative for ML?
    Versioning and collaboration on code for software engineers is a reasonably solved problem through GitHub since the task at hand predominantly involves just maintaining different copies of just simple vanilla code in different folders. On the other hand, ML engineers face the humungous task of maintaining different versions on not just code, but hyper parameters, data, models, data lineage and labels and storing this on GitHub currently does not allow you to track the changes on each variable well. What are the software/open source tools currently used for the same? Is their a space for a new company to be built here? submitted by /u/angkhandelwal749 [link] [comments]  ( 44 min )
    Are there any projects working at an open source version of Constitutional AI? [D]
    I'm looking into projects which augment the RLHF training approach of chatGPT with explicit rules, such as in https://paperswithcode.com/paper/constitutional-ai-harmlessness-from-ai. Ideally there would be both rules and priority levels between the rules, similarly to the Asimov laws of robotics. The Open-Assistant project (https://github.com/LAION-AI/Open-Assistant) captures the spirit, but it is looking to replicate chatGPT at the moment. submitted by /u/lorepieri [link] [comments]  ( 42 min )
    [D] Quantitative measure for smoothness of NLP autoencoder latent space
    I would like to measure the smoothness of an NLP-autoencoder's latent space. The idea is to sample two Gaussian vectors v1 and v2 in the latent space of the AE, and generate N-1 points between them like so: vi = v1 + (v2 - v1) / (N * i) My idea is to then decode these vectors and measure the BLEU score between d(vi) and d(vi+1) for all N-2 comparisons. Is this idea reasonable, do you have a better one? Is there a technique from AEs with images that can be useful here? submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    [D] What are some of your favorite ML research posters?
    And what are your own best practices when creating one (e.g. adding a QR code that links to the GitHub project or paper PDF)? submitted by /u/epistoteles [link] [comments]  ( 42 min )
    [D] Fastest and most accurate model for casing
    What is the state of the art regarding freely available casing models, i.e. DNNs, that try to restore the original casing of a text with uniform (either lowercase or capital letters) casing? I value both speed and accuracy, as I have to process a large corpus of text. submitted by /u/Blutorangensaft [link] [comments]  ( 43 min )
    Few questions about scalability of chatGPT [D]
    I have two questions about chatGPT. I don't come from a machine learning background. I am just a programmer. So bear with me if they sound a bit dumb. I was checking about chatGPT a bit the last week. I went through their papers and also tried out a fine tuning by myself by creating some fictional world and giving it some examples. The first thing I wondered is what is very special about the model than the large data and parameter set it has, that other competitors can't do. I ask this because I have seen a lot of "google killer" discussions in some places. From what I understood from their papers I thought it is something another company with the computing power and the filtered data can have up and running in few months. I see their advantage in rolling out to the public because with feedbacks from actual users all over the world it can potentially be retrained. The second thing I wondered is its scalability. It feels to me that it is a very big challenge to keep it scalable in the future. Currently getting a long text out of it is kind of painful because it has to continuously generate. I think it is continuously calculating with the huge parameter set it has. I wonder also about new trends, if it needs to be retrained. I also used it for a fine tuning, where I created a fictional world with its own law and rules and the fine tuning took hours in the queue - so is it creating separate parameters for my case? that would be a lot considering how much parameter set they have. submitted by /u/besabestin [link] [comments]  ( 50 min )
    [P] EvoTorch 0.4.0 dropped with GPU-accelerated implementations of CMA-ES, MAP-Elites and NSGA-II.
    Find the release notes here: https://github.com/nnaisense/evotorch/releases/tag/v0.4.0 A big highlight is how fast these implementations are! I genuinely believe GPU-acceleration is the future of Evolutionary algorithms, and EvoTorch and its integration into the PyTorch ecosystem is a fantastic enabler for this. To demonstrate the raw speed provided by the new release, I compared EvoTorch's CMA-ES implementation to that provided by the popular pycma package on the 80-dimensional Rastrigin problem and tracked the run-time: Performance was measured over 50 runs on the 80-dimensional Rastrigin problem The crazy thing to note is that when we switch to GPU (Tesla V100), we can efficiently run CMA-ES with population sizes going into 100k+! submitted by /u/NaturalGradient [link] [comments]  ( 45 min )
    Machine learning and black box numerical solver[D]
    Anybody know some methods and techniques for integrating a numerical solver with the neural network .. how do you calculate the gradients of the solver when you don’t know the details of such solver- black box solver. submitted by /u/Due-Wall-915 [link] [comments]  ( 43 min )
    [P] Diffusion models best practices
    I'm about to start an experimental project that involves training a denoising diffusion model on the medical data (small dataset). Could you please share useful resources, tips, tricks and heuristics for dealing with diffusion models? submitted by /u/debrises [link] [comments]  ( 42 min )
  • Open

    "Cheaters Hacked an AI Bot—and Beat the 'Rocket League' Elite"
    submitted by /u/gwern [link] [comments]  ( 40 min )
    Insights and learnings
    Hey all, I am part of an incubator and interested in building developer tooling for reinforcement learning. I would love to understand, from the RL community, what some of the biggest pinpoints are in developing and productionising RL agents. Would love to hear about your implementations too, if you are happy to share! submitted by /u/paramkumar1992 [link] [comments]  ( 41 min )
    Are there papers that do an empirical investigation on DRL hyperparameters?
    Could someone please help with this - https://ai.stackexchange.com/questions/38894/are-there-papers-that-do-an-empirical-investigation-on-drl-hyperparameters submitted by /u/Academic-Rent7800 [link] [comments]  ( 41 min )
    DQN application
    I want to train a DQN model in an off-policy fashion, where my behavior policy is an older agent. I have a big memory of a lot of episodes of this agent. Now I want to find a better policy using DQN. Now I am just wondering, in the "normal" DQN case you would use the experience replay buffer and would update behavior and target policy online (behavior not really online but with the time lag introduced after which these parameters are also updated). In my case, I already have all the experience and would like to learn from it. Do you think it makes sense to use the exact same procedure in this context, so sampling one new action, state and immediate reward, follow up action or could it be better here to use the fact that all experience is already stored to exchange the immediate reward + gamma*maxQ(s',a') with some more future information about the rewards (up to the point of Monte Carlo where you take G_t so the discounted cumulative reward seen during the episode from point t onwards)? submitted by /u/PatrickSVM [link] [comments]  ( 42 min )
    "Imitating Human Behaviour with Diffusion Models", Pearce et al 2023 {MS}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning", Wang et al 2022 {Twitter}
    submitted by /u/gwern [link] [comments]  ( 40 min )
    "Learning with Queried Hints" [on "Online Learning and Bandits with Queried Hints", Bhaskara et al 2022 {G}]
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    Why can't my neural network recognize my own digits, but it has 97% accuracy on mnist test samples?
    I've been learning about neural networks for the past few days and I decided to write my own in Python. To keep it simple, I didn't write code for convolution layers and such, only fully connected layers with logistic activation function. I first trained it to do XOR with a 2 -> 2 -> 1 layers layout and it worked. Then I tried to train a 28*28 -> 100 -> 10 network on the MNIST digit dataset to recognize digits. When running it on the test samples, the accuracy was 97%, but when running on my own samples it barely ever manages to get it right. Does anyone have any idea on why this would happen? submitted by /u/FidgetSpinzz [link] [comments]  ( 42 min )
    Why add bias instead of subtracting bias?
    Pretty much the title, why do we add the bias instead of subtracting? Also when i watched 3blue1browns video about neural networks nad he said that you subtract the with the bias, but other sources tell me or explain that you simply add the bias in the dot product instead of subtracting. //Newbie submitted by /u/ArthurLCTTheCool [link] [comments]  ( 41 min )
    Machine Learning Framework with Neural Networks for Java
    Hey guys, a buddy of mine and me created a Framework for Machine Learning in Java. It provides the possibilities to train the neural Networks via backpropagation but we also implemented a Genetic Algorithm which can also be used for the training. The project is intended to be used by people who only learn Java in school and want to try out ML without the need of learning Python or complex Java Libraries. It's designed to be easy to use and to be played around with. Qualifications needed to use: basic Java understanding A brief understanding of what Machine Learning is Here is a step-by-step tutorial how to predict diabetes with this framework: https://easy-ml.gitbook.io/easy-ml-for-java/fundamentals/implement-your-first-ai (This example is using the genetic Algorithm, there is already one example in the source code published using the Backpropagation approach but the tutorial for it is gonna follow in the next few days) Please also look at the GitHub repository and leave some feedback about code and design. (Especially considering the ReadMe) https://github.com/tomLamprecht/Easy-ML-For-Java https://easy-ml.gitbook.io/easy-ml-for-java/ (Doc) Right now, I'm working on adding Convolutional Neural Networks as well. Feel free to also check open issues on our GitHub if you want to contribute! :) Thanks so much, and also especially for those who contributed to our project with pull requests. PS: we earn no cent with this project, and we just do it for the experience. So feedback is basically our payment :D (and ofc stars on GitHub hehe) submitted by /u/Lampard557 [link] [comments]  ( 42 min )
  • Open

    Best Egg achieved three times faster ML model training with Amazon SageMaker Automatic Model Tuning
    This post is co-authored by Tristan Miller from Best Egg. Best Egg is a leading financial confidence platform that provides lending products and resources focused on helping people feel more confident as they manage their everyday finances. Since March 2014, Best Egg has delivered $22 billion in consumer personal loans with strong credit performance, welcomed […]  ( 8 min )
  • Open

    What Are Large Language Models Used For?
    AI applications are summarizing articles, writing stories and engaging in long conversations — and large language models are doing the heavy lifting. A large language model, or LLM, is a deep learning algorithm that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. Large language models Read article >  ( 7 min )
    DLSS 3 Delivers Ultimate Boost in Latest Game Updates on GeForce NOW
    GeForce NOW RTX 4080 SuperPODs are rolling out now, bringing RTX 4080-class performance and features to Ultimate members — including support for NVIDIA Ada Lovelace GPU architecture technologies like NVIDIA DLSS 3.  This GFN Thursday brings updates to some of GeForce NOW’s hottest games that take advantage of these amazing technologies, all from the cloud. Read article >  ( 6 min )
  • Open

    Remove algorithmic filters from what you read
    I typically announce new blog posts from my most relevant twitter account: data science from @DataSciFact, algebra and miscellaneous math from @AlgebraFact, TeX and typography from @TeXtip, etc. If you’d like to be sure that you’re notified of each post, regardless of what algorithms Twitter applies to your feed, you can subscribe to this blog […] Remove algorithmic filters from what you read first appeared on John D. Cook.  ( 5 min )
  • Open

    Robustness through Data Augmentation Loss Consistency. (arXiv:2110.11205v3 [cs.LG] UPDATED)
    While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Synthetic data augmentation followed by empirical risk minimization (DA-ERM) is a simple and widely used solution to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. For example, dialog state covaries with named entity when we augment data with a new named entity. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks involving covariant data augmentation. Our code of all experiments is available at: https://github.com/optimization-for-data-driven-science/DAIR.git  ( 3 min )
    Neuronal architecture extracts statistical temporal patterns. (arXiv:2301.10203v1 [q-bio.NC])
    Neuronal systems need to process temporal signals. We here show how higher-order temporal (co-)fluctuations can be employed to represent and process information. Concretely, we demonstrate that a simple biologically inspired feedforward neuronal model is able to extract information from up to the third order cumulant to perform time series classification. This model relies on a weighted linear summation of synaptic inputs followed by a nonlinear gain function. Training both - the synaptic weights and the nonlinear gain function - exposes how the non-linearity allows for the transfer of higher order correlations to the mean, which in turn enables the synergistic use of information encoded in multiple cumulants to maximize the classification accuracy. The approach is demonstrated both on a synthetic and on real world datasets of multivariate time series. Moreover, we show that the biologically inspired architecture makes better use of the number of trainable parameters as compared to a classical machine-learning scheme. Our findings emphasize the benefit of biological neuronal architectures, paired with dedicated learning algorithms, for the processing of information embedded in higher-order statistical cumulants of temporal (co-)fluctuations.  ( 2 min )
    Gradient-adjusted Incremental Target Propagation Provides Effective Credit Assignment in Deep Neural Networks. (arXiv:2102.11598v3 [cs.LG] UPDATED)
    Many of the recent advances in the field of artificial intelligence have been fueled by the highly successful backpropagation of error (BP) algorithm, which efficiently solves the credit assignment problem in artificial neural networks. However, it is unlikely that BP is implemented in its usual form within biological neural networks, because of its reliance on non-local information in propagating error gradients. Since biological neural networks are capable of highly efficient learning and responses from BP trained models can be related to neural responses, it seems reasonable that a biologically viable approximation of BP underlies synaptic plasticity in the brain. Gradient-adjusted incremental target propagation (GAIT-prop or GP for short) has recently been derived directly from BP and has been shown to successfully train networks in a more biologically plausible manner. However, so far, GP has only been shown to work on relatively low-dimensional problems, such as handwritten-digit recognition. This work addresses some of the scaling issues in GP and shows it to perform effective multi-layer credit assignment in deeper networks and on the much more challenging ImageNet dataset.  ( 2 min )
    Denoising Diffusion Probabilistic Models for Generation of Realistic Fully-Annotated Microscopy Image Data Sets. (arXiv:2301.10227v1 [eess.IV])
    Denoising diffusion probabilistic models have shown great potential in generating realistic image data. We show how those models can be used to generate realistic microscopy image data in 2D and 3D based on simulated sketches of cellular structures. Multiple data sets are used as an inspiration to simulate sketches of different cellular structures, allowing to generate fully-annotated image data sets without requiring human interactions. Those data sets are used to train segmentation approaches and demonstrate that annotation-free segmentation of cellular structures in fluorescence microscopy image data can be achieved, thereby leaping towards the ultimate goal of eliminating the necessity of human annotation efforts.  ( 2 min )
    CovidRhythm: A Deep Learning Model for Passive Prediction of Covid-19 using Biobehavioral Rhythms Derived from Wearable Physiological Data. (arXiv:2301.10168v1 [eess.SP])
    To investigate whether a deep learning model can detect Covid-19 from disruptions in the human body's physiological (heart rate) and rest-activity rhythms (rhythmic dysregulation) caused by the SARS-CoV-2 virus. We propose CovidRhythm, a novel Gated Recurrent Unit (GRU) Network with Multi-Head Self-Attention (MHSA) that combines sensor and rhythmic features extracted from heart rate and activity (steps) data gathered passively using consumer-grade smart wearable to predict Covid-19. A total of 39 features were extracted (standard deviation, mean, min/max/avg length of sedentary and active bouts) from wearable sensor data. Biobehavioral rhythms were modeled using nine parameters (mesor, amplitude, acrophase, and intra-daily variability). These features were then input to CovidRhythm for predicting Covid-19 in the incubation phase (one day before biological symptoms manifest). A combination of sensor and biobehavioral rhythm features achieved the highest AUC-ROC of 0.79 [Sensitivity = 0.69, Specificity=0.89, F$_{0.1}$ = 0.76], outperforming prior approaches in discriminating Covid-positive patients from healthy controls using 24 hours of historical wearable physiological. Rhythmic features were the most predictive of Covid-19 infection when utilized either alone or in conjunction with sensor features. Sensor features predicted healthy subjects best. Circadian rest-activity rhythms that combine 24h activity and sleep information were the most disrupted. CovidRhythm demonstrates that biobehavioral rhythms derived from consumer-grade wearable data can facilitate timely Covid-19 detection. To the best of our knowledge, our work is the first to detect Covid-19 using deep learning and biobehavioral rhythms features derived from consumer-grade wearable data.  ( 2 min )
    Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis. (arXiv:2301.10183v1 [cs.SD])
    Computer musicians refer to mesostructures as the intermediate levels of articulation between the microstructure of waveshapes and the macrostructure of musical forms. Examples of mesostructures include melody, arpeggios, syncopation, polyphonic grouping, and textural contrast. Despite their central role in musical expression, they have received limited attention in deep learning. Currently, autoencoders and neural audio synthesizers are only trained and evaluated at the scale of microstructure: i.e., local amplitude variations up to 100 milliseconds or so. In this paper, we formulate and address the problem of mesostructural audio modeling via a composition of a differentiable arpeggiator and time-frequency scattering. We empirically demonstrate that time--frequency scattering serves as a differentiable model of similarity between synthesis parameters that govern mesostructure. By exposing the sensitivity of short-time spectral distances to time alignment, we motivate the need for a time-invariant and multiscale differentiable time--frequency model of similarity at the level of both local spectra and spectrotemporal modulations.  ( 2 min )
    On the Tradeoff between Energy, Precision, and Accuracy in Federated Quantized Neural Networks. (arXiv:2111.07911v3 [cs.LG] UPDATED)
    Deploying federated learning (FL) over wireless networks with resource-constrained devices requires balancing between accuracy, energy efficiency, and precision. Prior art on FL often requires devices to train deep neural networks (DNNs) using a 32-bit precision level for data representation to improve accuracy. However, such algorithms are impractical for resource-constrained devices since DNNs could require execution of millions of operations. Thus, training DNNs with a high precision level incurs a high energy cost for FL. In this paper, a quantized FL framework, that represents data with a finite level of precision in both local training and uplink transmission, is proposed. Here, the finite level of precision is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with the quantization are rigorously derived. An energy minimization problem is formulated with respect to the level of precision while ensuring convergence. To solve the problem, we first analytically derive the FL convergence rate and use a line search method. Simulation results show that our FL framework can reduce energy consumption by up to 53% compared to a standard FL model. The results also shed light on the tradeoff between precision, energy, and accuracy in FL over wireless networks.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Read the Signs: Towards Invariance to Gradient Descent's Hyperparameter Initialization. (arXiv:2301.10133v1 [cs.LG])
    We propose ActiveLR, an optimization meta algorithm that localizes the learning rate, $\alpha$, and adapts them at each epoch according to whether the gradient at each epoch changes sign or not. This sign-conscious algorithm is aware of whether from the previous step to the current one the update of each parameter has been too large or too small and adjusts the $\alpha$ accordingly. We implement the Active version (ours) of widely used and recently published gradient descent optimizers, namely SGD with momentum, AdamW, RAdam, and AdaBelief. Our experiments on ImageNet, CIFAR-10, WikiText-103, WikiText-2, and PASCAL VOC using different model architectures, such as ResNet and Transformers, show an increase in generalizability and training set fit, and decrease in training time for the Active variants of the tested optimizers. The results also show robustness of the Active variant of these optimizers to different values of the initial learning rate. Furthermore, the detrimental effects of using large mini-batch sizes are mitigated. ActiveLR, thus, alleviates the need for hyper-parameter search for two of the most commonly tuned hyper-parameters that require heavy time and computational costs to pick. We encourage AI researchers and practitioners to use the Active variant of their optimizer of choice for faster training, better generalizability, and reducing carbon footprint of training deep neural networks.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Analysis of Arrhythmia Classification on ECG Dataset. (arXiv:2301.10174v1 [cs.LG])
    The heart is one of the most vital organs in the human body. It supplies blood and nutrients in other parts of the body. Therefore, maintaining a healthy heart is essential. As a heart disorder, arrhythmia is a condition in which the heart's pumping mechanism becomes aberrant. The Electrocardiogram is used to analyze the arrhythmia problem from the ECG signals because of its fewer difficulties and cheapness. The heart peaks shown in the ECG graph are used to detect heart diseases, and the R peak is used to analyze arrhythmia disease. Arrhythmia is grouped into two groups - Tachycardia and Bradycardia for detection. In this paper, we discussed many different techniques such as Deep CNNs, LSTM, SVM, NN classifier, Wavelet, TQWT, etc., that have been used for detecting arrhythmia using various datasets throughout the previous decade. This work shows the analysis of some arrhythmia classification on the ECG dataset. Here, Data preprocessing, feature extraction, classification processes were applied on most research work and achieved better performance for classifying ECG signals to detect arrhythmia. Automatic arrhythmia detection can help cardiologists make the right decisions immediately to save human life. In addition, this research presents various previous research limitations with some challenges in detecting arrhythmia that will help in future research.  ( 2 min )
    Sleep Activity Recognition and Characterization from Multi-Source Passively Sensed Data. (arXiv:2301.10156v1 [eess.SP])
    Sleep constitutes a key indicator of human health, performance, and quality of life. Sleep deprivation has long been related to the onset, development, and worsening of several mental and metabolic disorders, constituting an essential marker for preventing, evaluating, and treating different health conditions. Sleep Activity Recognition methods can provide indicators to assess, monitor, and characterize subjects' sleep-wake cycles and detect behavioral changes. In this work, we propose a general method that continuously operates on passively sensed data from smartphones to characterize sleep and identify significant sleep episodes. Thanks to their ubiquity, these devices constitute an excellent alternative data source to profile subjects' biorhythms in a continuous, objective, and non-invasive manner, in contrast to traditional sleep assessment methods that usually rely on intrusive and subjective procedures. A Heterogeneous Hidden Markov Model is used to model a discrete latent variable process associated with the Sleep Activity Recognition task in a self-supervised way. We validate our results against sleep metrics reported by tested wearables, proving the effectiveness of the proposed approach and advocating its use to assess sleep without more reliable sources.  ( 2 min )
    EEG Opto-processor: epileptic seizure detection using diffractive photonic computing units. (arXiv:2301.10167v1 [eess.SP])
    Electroencephalography (EEG) analysis extracts critical information from brain signals, which has provided fundamental support for various applications, including brain-disease diagnosis and brain-computer interface. However, the real-time processing of large-scale EEG signals at high energy efficiency has placed great challenges for electronic processors on edge computing devices. Here, we propose the EEG opto-processor based on diffractive photonic computing units (DPUs) to effectively process the extracranial and intracranial EEG signals and perform epileptic seizure detection. The signals of EEG channels within a second-time window are optically encoded as inputs to the constructed diffractive neural networks for classification, which monitors the brain state to determine whether it's the symptom of an epileptic seizure or not. We developed both the free-space and integrated DPUs as edge computing systems and demonstrated their applications for real-time epileptic seizure detection with the benchmark datasets, i.e., the CHB-MIT extracranial EEG dataset and Epilepsy-iEEG-Multicenter intracranial EEG dataset, at high computing performance. Along with the channel selection mechanism, both the numerical evaluations and experimental results validated the sufficient high classification accuracies of the proposed opto-processors for supervising the clinical diagnosis. Our work opens up a new research direction of utilizing photonic computing techniques for processing large-scale EEG signals in promoting its broader applications.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    VaiPhy: a Variational Inference Based Algorithm for Phylogeny. (arXiv:2203.01121v3 [q-bio.PE] UPDATED)
    Phylogenetics is a classical methodology in computational biology that today has become highly relevant for medical investigation of single-cell data, e.g., in the context of cancer development. The exponential size of the tree space is, unfortunately, a substantial obstacle for Bayesian phylogenetic inference using Markov chain Monte Carlo based methods since these rely on local operations. And although more recent variational inference (VI) based methods offer speed improvements, they rely on expensive auto-differentiation operations for learning the variational parameters. We propose VaiPhy, a remarkably fast VI based algorithm for approximate posterior inference in an augmented tree space. VaiPhy produces marginal log-likelihood estimates on par with the state-of-the-art methods on real data and is considerably faster since it does not require auto-differentiation. Instead, VaiPhy combines coordinate ascent update equations with two novel sampling schemes: (i) SLANTIS, a proposal distribution for tree topologies in the augmented tree space, and (ii) the JC sampler, to the best of our knowledge, the first-ever scheme for sampling branch lengths directly from the popular Jukes-Cantor model. We compare VaiPhy in terms of density estimation and runtime. Additionally, we evaluate the reproducibility of the baselines. We provide our code on GitHub: \url{https://github.com/Lagergren-Lab/VaiPhy}.  ( 2 min )
    Lowering Detection in Sport Climbing Based on Orientation of the Sensor Enhanced Quickdraw. (arXiv:2301.10164v1 [eess.SP])
    Tracking climbers' activity to improve services and make the best use of their infrastructure is a concern for climbing gyms. Each climbing session must be analyzed from beginning till lowering of the climber. Therefore, spotting the climbers descending is crucial since it indicates when the ascent has come to an end. This problem must be addressed while preserving privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence become practical in terms of expenses and time consumption for replacement when using in large quantity in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect sensors' orientation patterns during lowering different routes, and develop an supervised approach to identify lowering.  ( 2 min )
    Sequential Graph Attention Learning for Predicting Dynamic Stock Trends (Student Abstract). (arXiv:2301.10153v1 [q-fin.ST])
    The stock market is characterized by a complex relationship between companies and the market. This study combines a sequential graph structure with attention mechanisms to learn global and local information within temporal time. Specifically, our proposed "GAT-AGNN" module compares model performance across multiple industries as well as within single industries. The results show that the proposed framework outperforms the state-of-the-art methods in predicting stock trends across multiple industries on Taiwan Stock datasets.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    Computational Solar Energy -- Ensemble Learning Methods for Prediction of Solar Power Generation based on Meteorological Parameters in Eastern India. (arXiv:2301.10159v1 [cs.LG])
    The challenges in applications of solar energy lies in its intermittency and dependency on meteorological parameters such as; solar radiation, ambient temperature, rainfall, wind-speed etc., and many other physical parameters like dust accumulation etc. Hence, it is important to estimate the amount of solar photovoltaic (PV) power generation for a specific geographical location. Machine learning (ML) models have gained importance and are widely used for prediction of solar power plant performance. In this paper, the impact of weather parameters on solar PV power generation is estimated by several Ensemble ML (EML) models like Bagging, Boosting, Stacking, and Voting for the first time. The performance of chosen ML algorithms is validated by field dataset of a 10kWp solar PV power plant in Eastern India region. Furthermore, a complete test-bed framework has been designed for data mining as well as to select appropriate learning models. It also supports feature selection and reduction for dataset to reduce space and time complexity of the learning models. The results demonstrate greater prediction accuracy of around 96% for Stacking and Voting EML models. The proposed work is a generalized one and can be very useful for predicting the performance of large-scale solar PV power plants also.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Pex: Memory-efficient Microcontroller Deep Learning through Partial Execution. (arXiv:2211.17246v2 [cs.LG] UPDATED)
    Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
    Multi-Agent Patrolling with Battery Constraints through Deep Reinforcement Learning. (arXiv:2212.08230v2 [cs.AI] UPDATED)
    Autonomous vehicles are suited for continuous area patrolling problems. However, finding an optimal patrolling strategy can be challenging for many reasons. Firstly, patrolling environments are often complex and can include unknown environmental factors. Secondly, autonomous vehicles can have failures or hardware constraints, such as limited battery life. Importantly, patrolling large areas often requires multiple agents that need to collectively coordinate their actions. In this work, we consider these limitations and propose an approach based on model-free, deep multi-agent reinforcement learning. In this approach, the agents are trained to automatically recharge themselves when required, to support continuous collective patrolling. A distributed homogeneous multi-agent architecture is proposed, where all patrolling agents execute identical policies locally based on their local observations and shared information. This architecture provides a fault-tolerant and robust patrolling system that can tolerate agent failures and allow supplementary agents to be added to replace failed agents or to increase the overall patrol performance. The solution is validated through simulation experiments from multiple perspectives, including the overall patrol performance, the efficiency of battery recharging strategies, and the overall fault tolerance and robustness.
    Dirac signal processing of higher-order topological signals. (arXiv:2301.10137v1 [eess.SP])
    We consider topological signals corresponding to variables supported on nodes, links and triangles of higher-order networks and simplicial complexes. So far such signals are typically processed independently of each other, and algorithms that can enforce a consistent processing of topological signals across different levels are largely lacking. Here we propose Dirac signal processing, an adaptive, unsupervised signal processing algorithm that learns to jointly filter topological signals supported on nodes, links and (filled) triangles of simplicial complexes in a consistent way. The proposed Dirac signal processing algorithm is rooted in algebraic topology and formulated in terms of the discrete Dirac operator which can be interpreted as ``square root" of a higher-order (Hodge) Laplacian matrix acting on nodes, links and triangles of simplicial complexes. We test our algorithms on noisy synthetic data and noisy data of drifters in the ocean and find that the algorithm can learn to efficiently reconstruct the true signals outperforming algorithms based exclusively on the Hodge Laplacian.
    Towards Asteroid Detection in Microlensing Surveys with Deep Learning. (arXiv:2211.02239v2 [astro-ph.EP] UPDATED)
    Asteroids are an indelible part of most astronomical surveys though only a few surveys are dedicated to their detection. Over the years, high cadence microlensing surveys have amassed several terabytes of data while scanning primarily the Galactic Bulge and Magellanic Clouds for microlensing events and thus provide a treasure trove of opportunities for scientific data mining. In particular, numerous asteroids have been observed by visual inspection of selected images. This paper presents novel deep learning-based solutions for the recovery and discovery of asteroids in the microlensing data gathered by the MOA project. Asteroid tracklets can be clearly seen by combining all the observations on a given night and these tracklets inform the structure of the dataset. Known asteroids were identified within these composite images and used for creating the labelled datasets required for supervised learning. Several custom CNN models were developed to identify images with asteroid tracklets. Model ensembling was then employed to reduce the variance in the predictions as well as to improve the generalisation error, achieving a recall of 97.67%. Furthermore, the YOLOv4 object detector was trained to localize asteroid tracklets, achieving a mean Average Precision (mAP) of 90.97%. These trained networks will be applied to 16 years of MOA archival data to find both known and unknown asteroids that have been observed by the survey over the years. The methodologies developed can be adapted for use by other surveys for asteroid recovery and discovery.
    Neural Implicit k-Space for Binning-free Non-Cartesian Cardiac MR Imaging. (arXiv:2212.08479v2 [eess.IV] UPDATED)
    In this work, we propose a novel image reconstruction framework that directly learns a neural implicit representation in k-space for ECG-triggered non-Cartesian Cardiac Magnetic Resonance Imaging (CMR). While existing methods bin acquired data from neighboring time points to reconstruct one phase of the cardiac motion, our framework allows for a continuous, binning-free, and subject-specific k-space representation.We assign a unique coordinate that consists of time, coil index, and frequency domain location to each sampled k-space point. We then learn the subject-specific mapping from these unique coordinates to k-space intensities using a multi-layer perceptron with frequency domain regularization. During inference, we obtain a complete k-space for Cartesian coordinates and an arbitrary temporal resolution. A simple inverse Fourier transform recovers the image, eliminating the need for density compensation and costly non-uniform Fourier transforms for non-Cartesian data. This novel imaging framework was tested on 42 radially sampled datasets from 6 subjects. The proposed method outperforms other techniques qualitatively and quantitatively using data from four and one heartbeat(s) and 30 cardiac phases. Our results for one heartbeat reconstruction of 50 cardiac phases show improved artifact removal and spatio-temporal resolution, leveraging the potential for real-time CMR.
    A Learning Based Hypothesis Test for Harmful Covariate Shift. (arXiv:2212.02742v3 [cs.LG] UPDATED)
    The ability to quickly and accurately identify covariate shift at test time is a critical and often overlooked component of safe machine learning systems deployed in high-risk domains. While methods exist for detecting when predictions should not be made on out-of-distribution test examples, identifying distributional level differences between training and test time can help determine when a model should be removed from the deployment setting and retrained. In this work, we define harmful covariate shift (HCS) as a change in distribution that may weaken the generalization of a predictive model. To detect HCS, we use the discordance between an ensemble of classifiers trained to agree on training data and disagree on test data. We derive a loss function for training this ensemble and show that the disagreement rate and entropy represent powerful discriminative statistics for HCS. Empirically, we demonstrate the ability of our method to detect harmful covariate shift with statistical certainty on a variety of high-dimensional datasets. Across numerous domains and modalities, we show state-of-the-art performance compared to existing methods, particularly when the number of observed test samples is small.
    Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. (arXiv:2210.10246v2 [cs.LG] UPDATED)
    Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the baseline.
    Quadruple-star systems are not always nested triples: a machine learning approach to dynamical stability. (arXiv:2301.09930v1 [cs.LG])
    The dynamical stability of quadruple-star systems has traditionally been treated as a problem involving two `nested' triples which constitute a quadruple. In this novel study, we employed a machine learning algorithm, the multi-layer perceptron (MLP), to directly classify 2+2 and 3+1 quadruples based on their stability (or long-term boundedness). The training data sets for the classification, comprised of $5\times10^5$ quadruples each, were integrated using the highly accurate direct $N$-body code MSTAR. We also carried out a limited parameter space study of zero-inclination systems to directly compare quadruples to triples. We found that both our quadruple MLP models perform better than a `nested' triple MLP approach, which is especially significant for 3+1 quadruples. The classification accuracies for the 2+2 MLP and 3+1 MLP models are 94% and 93% respectively, while the scores for the `nested' triple approach are 88% and 66% respectively. This is a crucial implication for quadruple population synthesis studies. Our MLP models, which are very simple and almost instantaneous to implement, are available on GitHub, along with Python3 scripts to access them.
    Exploring Effects of Computational Parameter Changes to Image Recognition Systems. (arXiv:2211.00471v3 [cs.LG] UPDATED)
    Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models.
    Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment Battery. (arXiv:2210.10952v2 [stat.ME] UPDATED)
    The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization and null hypothesis statistical testing for item partitioning/selection, and finally, posthoc calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. Its item partitioning, based on exploratory factor analysis, is blind to the final nonlinear IRT model and is not performed in a manner consistent with goodness of fit to the final model. In this manuscript, we develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method uses sparsity-based shrinkage to obviate the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that item partitioning is consistent with the ultimate nonlinear factor model. We also analogize our multidimensional IRT model to probabilistic autoencoders, specifying an encoder function that amortizes the inference of ability parameters from item responses. The encoder function is equivalent to the "VBE" step in a stochastic variational Bayesian expectation maximization (VBEM) procedure that we use for approxiamte Bayesian inference on the entire model. We use the method on a sample of WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional posthoc method.
    Unsupervised Model Selection for Time-series Anomaly Detection. (arXiv:2210.01078v3 [cs.LG] UPDATED)
    Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.
    Broken Neural Scaling Laws. (arXiv:2210.14891v5 [cs.LG] UPDATED)
    We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, training dataset size, or upstream performance varies) for various architectures and for each of various tasks within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision, language, audio, video, diffusion generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, arithmetic, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent). When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Lastly, we use this functional form to glean insights about the limit of the predictability of scaling behavior. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws
    Tracking the industrial growth of modern China with high-resolution panchromatic imagery: A sequential convolutional approach. (arXiv:2301.09620v1 [cs.CV] CROSS LISTED)
    Due to insufficient or difficult to obtain data on development in inaccessible regions, remote sensing data is an important tool for interested stakeholders to collect information on economic growth. To date, no studies have utilized deep learning to estimate industrial growth at the level of individual sites. In this study, we harness high-resolution panchromatic imagery to estimate development over time at 419 industrial sites in the People's Republic of China using a multi-tier computer vision framework. We present two methods for approximating development: (1) structural area coverage estimated through a Mask R-CNN segmentation algorithm, and (2) imputing development directly with visible & infrared radiance from the Visible Infrared Imaging Radiometer Suite (VIIRS). Labels generated from these methods are comparatively evaluated and tested. On a dataset of 2,078 50 cm resolution images spanning 19 years, the results indicate that two dimensions of industrial development can be estimated using high-resolution daytime imagery, including (a) the total square meters of industrial development (average error of 0.021 $\textrm{km}^2$), and (b) the radiance of lights (average error of 9.8 $\mathrm{\frac{nW}{cm^{2}sr}}$). Trend analysis of the techniques reveal estimates from a Mask R-CNN-labeled CNN-LSTM track ground truth measurements most closely. The Mask R-CNN estimates positive growth at every site from the oldest image to the most recent, with an average change of 4,084 $\textrm{m}^2$.
    Self-Supervised Learning Through Efference Copies. (arXiv:2210.09224v2 [cs.LG] UPDATED)
    Self-supervised learning (SSL) methods aim to exploit the abundance of unlabelled data for machine learning (ML), however the underlying principles are often method-specific. An SSL framework derived from biological first principles of embodied learning could unify the various SSL methods, help elucidate learning in the brain, and possibly improve ML. SSL commonly transforms each training datapoint into a pair of views, uses the knowledge of this pairing as a positive (i.e. non-contrastive) self-supervisory sign, and potentially opposes it to unrelated, (i.e. contrastive) negative examples. Here, we show that this type of self-supervision is an incomplete implementation of a concept from neuroscience, the Efference Copy (EC). Specifically, the brain also transforms the environment through efference, i.e. motor commands, however it sends to itself an EC of the full commands, i.e. more than a mere SSL sign. In addition, its action representations are likely egocentric. From such a principled foundation we formally recover and extend SSL methods such as SimCLR, BYOL, and ReLIC under a common theoretical framework, i.e. Self-supervision Through Efference Copies (S-TEC). Empirically, S-TEC restructures meaningfully the within- and between-class representations. This manifests as improvement in recent strong SSL baselines in image classification, segmentation, object detection, and in audio. These results hypothesize a testable positive influence from the brain's motor outputs onto its sensory representations.
    Foresight -- Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs. (arXiv:2212.08072v2 [cs.CL] UPDATED)
    Background: Electronic Health Records hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Existing approaches focus mostly on structured data and a subset of single-domain outcomes. We explore how temporal modelling of patients from free text and structured data, using deep generative transformers can be used to forecast a wide range of future disorders, substances, procedures or findings. Methods: We present Foresight, a novel transformer-based pipeline that uses named entity recognition and linking tools to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, substances, procedures and findings. We processed the entire free-text portion from three different hospital datasets totalling 811336 patients covering both physical and mental health. Findings: On tests in two UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 0.68, 0.76 and 0.88 was achieved for forecasting the next disorder in a patient timeline, while precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by five clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. As a generative model, it can forecast follow-on biomedical concepts for as many steps as required. Interpretation: Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk forecasting, virtual trials and clinical research to study the progression of disorders, simulate interventions and counterfactuals, and educational purposes.
    ESTAS: Effective and Stable Trojan Attacks in Self-supervised Encoders with One Target Unlabelled Sample. (arXiv:2211.10908v2 [cs.CV] UPDATED)
    Emerging self-supervised learning (SSL) has become a popular image representation encoding method to obviate the reliance on labeled data and learn rich representations from large-scale, ubiquitous unlabelled data. Then one can train a downstream classifier on top of the pre-trained SSL image encoder with few or no labeled downstream data. Although extensive works show that SSL has achieved remarkable and competitive performance on different downstream tasks, its security concerns, e.g, Trojan attacks in SSL encoders, are still not well-studied. In this work, we present a novel Trojan Attack method, denoted by ESTAS, that can enable an effective and stable attack in SSL encoders with only one target unlabeled sample. In particular, we propose consistent trigger poisoning and cascade optimization in ESTAS to improve attack efficacy and model accuracy, and eliminate the expensive target-class data sample extraction from large-scale disordered unlabelled data. Our substantial experiments on multiple datasets show that ESTAS stably achieves > 99% attacks success rate (ASR) with one target-class sample. Compared to prior works, ESTAS attains > 30% ASR increase and > 8.3% accuracy improvement on average.
    Visual Simulation Software Demonstration for Quantum Multi-Drone Reinforcement Learning. (arXiv:2211.15375v2 [quant-ph] UPDATED)
    Quantum computing (QC) has received a lot of attention according to its light training parameter numbers and computational speeds by qubits. Moreover, various researchers have tried to enable quantum machine learning (QML) using QC, where there are also multifarious efforts to use QC to implement quantum multi-agent reinforcement learning (QMARL). Existing classical multi-agent reinforcement learning (MARL) using neural network features non-stationarity and uncertain properties due to its large number of parameters. Therefore, this paper presents a visual simulation software framework for a novel QMARL algorithm to control autonomous multi-drone systems to take advantage of QC. Our proposed QMARL framework accomplishes reasonable reward convergence and service quality performance with fewer trainable parameters than the classical MARL. Furthermore, QMARL shows more stable training results than existing MARL algorithms. Lastly, our proposed visual simulation software allows us to analyze the agents' training process and results.
    Green, Quantized Federated Learning over Wireless Networks: An Energy-Efficient Design. (arXiv:2207.09387v2 [cs.LG] UPDATED)
    In this paper, a green-quantized FL framework, which represents data with a finite precision level in both local training and uplink transmission, is proposed. Here, the finite precision level is captured through the use of quantized neural networks (QNNs) that quantize weights and activations in fixed-precision format. In the considered FL model, each device trains its QNN and transmits a quantized training result to the base station. Energy models for the local training and the transmission with quantization are rigorously derived. To minimize the energy consumption and the number of communication rounds simultaneously, a multi-objective optimization problem is formulated with respect to the number of local iterations, the number of selected devices, and the precision levels for both local training and transmission while ensuring convergence under a target accuracy constraint. To solve this problem, the convergence rate of the proposed FL system is analytically derived with respect to the system control variables. Then, the Pareto boundary of the problem is characterized to provide efficient solutions using the normal boundary inspection method. Design insights on balancing the tradeoff between the two objectives while achieving a target accuracy are drawn from using the Nash bargaining solution and analyzing the derived convergence rate. Simulation results show that the proposed FL framework can reduce energy consumption until convergence by up to 70\% compared to a baseline FL algorithm that represents data with full precision without damaging the convergence rate.
    SIAN: Style-Guided Instance-Adaptive Normalization for Multi-Organ Histopathology Image Synthesis. (arXiv:2209.02412v2 [eess.IV] UPDATED)
    Existing deep neural networks for histopathology image synthesis cannot generate image styles that align with different organs, and cannot produce accurate boundaries of clustered nuclei. To address these issues, we propose a style-guided instance-adaptive normalization (SIAN) approach to synthesize realistic color distributions and textures for histopathology images from different organs. SIAN contains four phases, semantization, stylization, instantiation, and modulation. The first two phases synthesize image semantics and styles by using semantic maps and learned image style vectors. The instantiation module integrates geometrical and topological information and generates accurate nuclei boundaries. We validate the proposed approach on a multiple-organ dataset, Extensive experimental results demonstrate that the proposed method generates more realistic histopathology images than four state-of-the-art approaches for five organs. By incorporating synthetic images from the proposed approach to model training, an instance segmentation network can achieve state-of-the-art performance.
    RAIN: RegulArization on Input and Network for Black-Box Domain Adaptation. (arXiv:2208.10531v2 [cs.CV] UPDATED)
    Source-Free domain adaptation transits the source-trained model towards target domain without exposing the source data, trying to dispel these concerns about data privacy and security. However, this paradigm is still at risk of data leakage due to adversarial attacks on the source model. Hence, the Black-Box setting only allows to use the outputs of source model, but still suffers from overfitting on the source domain more severely due to source model's unseen weights. In this paper, we propose a novel approach named RAIN (RegulArization on Input and Network) for Black-Box domain adaptation from both input-level and network-level regularization. For the input-level, we design a new data augmentation technique as Phase MixUp, which highlights task-relevant objects in the interpolations, thus enhancing input-level regularization and class consistency for target models. For network-level, we develop a Subnetwork Distillation mechanism to transfer knowledge from the target subnetwork to the full target network via knowledge distillation, which thus alleviates overfitting on the source domain by learning diverse target representations. Extensive experiments show that our method achieves state-of-the-art performance on several cross-domain benchmarks under both single- and multi-source black-box domain adaptation.
    A computational framework for physics-informed symbolic regression with straightforward integration of domain knowledge. (arXiv:2209.06257v3 [cs.LG] UPDATED)
    Discovering a meaningful symbolic expression that explains experimental data is a fundamental challenge in many scientific fields. We present a novel, open-source computational framework called Scientist-Machine Equation Detector (SciMED), which integrates scientific discipline wisdom in a scientist-in-the-loop approach, with state-of-the-art symbolic regression (SR) methods. SciMED combines a wrapper selection method, that is based on a genetic algorithm, with automatic machine learning and two levels of SR methods. We test SciMED on five configurations of a settling sphere, with and without aerodynamic non-linear drag force, and with excessive noise in the measurements. We show that SciMED is sufficiently robust to discover the correct physically meaningful symbolic expressions from the data, and demonstrate how the integration of domain knowledge enhances its performance. Our results indicate better performance on these tasks than the state-of-the-art SR software packages , even in cases where no knowledge is integrated. Moreover, we demonstrate how SciMED can alert the user about possible missing features, unlike the majority of current SR systems.
    Measuring Fairness Under Unawareness of Sensitive Attributes: A Quantification-Based Approach. (arXiv:2109.08549v4 [cs.CY] UPDATED)
    Algorithms and models are increasingly deployed to inform decisions about people, inevitably affecting their lives. As a consequence, those in charge of developing these models must carefully evaluate their impact on different groups of people and favour group fairness, that is, ensure that groups determined by sensitive demographic attributes, such as race or sex, are not treated unjustly. To achieve this goal, the availability (awareness) of these demographic attributes to those evaluating the impact of these models is fundamental. Unfortunately, collecting and storing these attributes is often in conflict with industry practices and legislation on data minimisation and privacy. For this reason, it can be hard to measure the group fairness of trained models, even from within the companies developing them. In this work, we tackle the problem of measuring group fairness under unawareness of sensitive attributes, by using techniques from quantification, a supervised learning task concerned with directly providing group-level prevalence estimates (rather than individual-level class labels). We show that quantification approaches are particularly suited to tackle the fairness-under-unawareness problem, as they are robust to inevitable distribution shifts while at the same time decoupling the (desirable) objective of measuring group fairness from the (undesirable) side effect of allowing the inference of sensitive attributes of individuals. More in detail, we show that fairness under unawareness can be cast as a quantification problem and solved with proven methods from the quantification literature. We show that these methods outperform previous approaches to measure demographic parity in five experimental protocols, corresponding to important challenges that complicate the estimation of classifier fairness under unawareness.
    Learning to Counter: Stochastic Feature-based Learning for Diverse Counterfactual Explanations. (arXiv:2209.13446v2 [cs.AI] UPDATED)
    Interpretable machine learning seeks to understand the reasoning process of complex black-box systems that are long notorious for lack of explainability. One flourishing approach is through counterfactual explanations, which provide suggestions on what a user can do to alter an outcome. Not only must a counterfactual example counter the original prediction from the black-box classifier but it should also satisfy various constraints for practical applications. Diversity is one of the critical constraints that however remains less discussed. While diverse counterfactuals are ideal, it is computationally challenging to simultaneously address some other constraints. Furthermore, there is a growing privacy concern over the released counterfactual data. To this end, we propose a feature-based learning framework that effectively handles the counterfactual constraints and contributes itself to the limited pool of private explanation models. We demonstrate the flexibility and effectiveness of our method in generating diverse counterfactuals of actionability and plausibility. Our counterfactual engine is more efficient than counterparts of the same capacity while yielding the lowest re-identification risks.
    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.
    Can large language models reason about medical questions?. (arXiv:2207.08143v3 [cs.CL] UPDATED)
    Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether GPT-3.5 (Codex and InstructGPT) can be applied to answer and reason about difficult real-world-based questions. We utilize two multiple-choice medical exam questions (USMLE and MedMCQA) and a medical reading comprehension dataset (PubMedQA). We investigate multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), zero- and few-shot (prepending the question with question-answer exemplars) and retrieval augmentation (injecting Wikipedia passages into the prompt). For a subset of the USMLE questions, a medical expert reviewed and annotated the model's CoT. We found that InstructGPT can often read, reason and recall expert knowledge. Failure are primarily due to lack of knowledge and reasoning errors and trivial guessing heuristics are observed, e.g.\ too often predicting labels A and D on USMLE. Sampling and combining many completions overcome some of these limitations. Using 100 samples, Codex 5-shot CoT not only gives close to well-calibrated predictive probability but also achieves human-level performances on the three datasets. USMLE: 60.2%, MedMCQA: 62.7% and PubMedQA: 78.2%.
    To Trust or Not To Trust Prediction Scores for Membership Inference Attacks. (arXiv:2111.09076v3 [cs.LG] UPDATED)
    Membership inference attacks (MIAs) aim to determine whether a specific sample was used to train a predictive model. Knowing this may indeed lead to a privacy breach. Most MIAs, however, make use of the model's prediction scores - the probability of each output given some input - following the intuition that the trained model tends to behave differently on its training data. We argue that this is a fallacy for many modern deep network architectures. Consequently, MIAs will miserably fail since overconfidence leads to high false-positive rates not only on known domains but also on out-of-distribution data and implicitly acts as a defense against MIAs. Specifically, using generative adversarial networks, we are able to produce a potentially infinite number of samples falsely classified as part of the training data. In other words, the threat of MIAs is overestimated, and less information is leaked than previously assumed. Moreover, there is actually a trade-off between the overconfidence of models and their susceptibility to MIAs: the more classifiers know when they do not know, making low confidence predictions, the more they reveal the training data.
    Topogivity: A Machine-Learned Chemical Rule for Discovering Topological Materials. (arXiv:2202.05255v3 [cond-mat.mtrl-sci] UPDATED)
    Topological materials present unconventional electronic properties that make them attractive for both basic science and next-generation technological applications. The majority of currently known topological materials have been discovered using methods that involve symmetry-based analysis of the quantum wavefunction. Here we use machine learning to develop a simple-to-use heuristic chemical rule that diagnoses with a high accuracy whether a material is topological using only its chemical formula. This heuristic rule is based on a notion that we term topogivity, a machine-learned numerical value for each element that loosely captures its tendency to form topological materials. We next implement a high-throughput procedure for discovering topological materials based on the heuristic topogivity-rule prediction followed by ab initio validation. This way, we discover new topological materials that are not diagnosable using symmetry indicators, including several that may be promising for experimental observation.
    Gaze-based Object Detection in the Wild. (arXiv:2203.15651v2 [cs.RO] UPDATED)
    In human-robot collaboration, one challenging task is to teach a robot new yet unknown objects enabling it to interact with them. Thereby, gaze can contain valuable information. We investigate if it is possible to detect objects (object or no object) merely from gaze data and determine their bounding box parameters. For this purpose, we explore different sizes of temporal windows, which serve as a basis for the computation of heatmaps, i.e., the spatial distribution of the gaze data. Additionally, we analyze different grid sizes of these heatmaps, and demonstrate the functionality in a proof of concept using different machine learning techniques. Our method is characterized by its speed and resource efficiency compared to conventional object detectors. In order to generate the required data, we conducted a study with five subjects who could move freely and thus, turn towards arbitrary objects. This way, we chose a scenario for our data collection that is as realistic as possible. Since the subjects move while facing objects, the heatmaps also contain gaze data trajectories, complicating the detection and parameter regression. We make our data set publicly available to the research community for download.
    Mixed Effects Random Forests for Personalised Predictions of Clinical Depression Severity. (arXiv:2301.09815v1 [cs.LG])
    This work demonstrates how mixed effects random forests enable accurate predictions of depression severity using multimodal physiological and digital activity data collected from an 8-week study involving 31 patients with major depressive disorder. We show that mixed effects random forests outperform standard random forests and personal average baselines when predicting clinical Hamilton Depression Rating Scale scores (HDRS_17). Compared to the latter baseline, accuracy is significantly improved for each patient by an average of 0.199-0.276 in terms of mean absolute error (p<0.05). This is noteworthy as these simple baselines frequently outperform machine learning methods in mental health prediction tasks. We suggest that this improved performance results from the ability of the mixed effects random forest to personalise model parameters to individuals in the dataset. However, we find that these improvements pertain exclusively to scenarios where labelled patient data are available to the model at training time. Investigating methods that improve accuracy when generalising to new patients is left as important future work.
    Integrating Reward Maximization and Population Estimation: Sequential Decision-Making for Internal Revenue Service Audit Selection. (arXiv:2204.11910v3 [cs.LG] UPDATED)
    We introduce a new setting, optimize-and-estimate structured bandits. Here, a policy must select a batch of arms, each characterized by its own context, that would allow it to both maximize reward and maintain an accurate (ideally unbiased) population estimate of the reward. This setting is inherent to many public and private sector applications and often requires handling delayed feedback, small data, and distribution shifts. We demonstrate its importance on real data from the United States Internal Revenue Service (IRS). The IRS performs yearly audits of the tax base. Two of its most important objectives are to identify suspected misreporting and to estimate the "tax gap" -- the global difference between the amount paid and true amount owed. Based on a unique collaboration with the IRS, we cast these two processes as a unified optimize-and-estimate structured bandit. We analyze optimize-and-estimate approaches to the IRS problem and propose a novel mechanism for unbiased population estimation that achieves rewards comparable to baseline approaches. This approach has the potential to improve audit efficacy, while maintaining policy-relevant estimates of the tax gap. This has important social consequences given that the current tax gap is estimated at nearly half a trillion dollars. We suggest that this problem setting is fertile ground for further research and we highlight its interesting challenges. The results of this and related research are currently being incorporated into the continual improvement of the IRS audit selection methods.
    Efficient Planning in a Compact Latent Action Space. (arXiv:2208.10291v3 [cs.LG] UPDATED)
    Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.
    A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range. (arXiv:2203.13273v5 [cs.LG] UPDATED)
    We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient g_t and its first momentum m_t before using them to estimate the second momentum. The new optimization method is referred to as Aida. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet. Code is available at this URL
    Planckian Jitter: countering the color-crippling effects of color jitter on self-supervised training. (arXiv:2202.07993v2 [cs.CV] UPDATED)
    Several recent works on self-supervised learning are trained by mapping different augmentations of the same image to the same feature representation. The data augmentations used are of crucial importance to the quality of learned feature representations. In this paper, we analyze how the color jitter traditionally used in data augmentation negatively impacts the quality of the color features in learned feature representations. To address this problem, we propose a more realistic, physics-based color data augmentation - which we call Planckian Jitter - that creates realistic variations in chromaticity and produces a model robust to illumination changes that can be commonly observed in real life, while maintaining the ability to discriminate image content based on color information. Experiments confirm that such a representation is complementary to the representations learned with the currently-used color jitter augmentation and that a simple concatenation leads to significant performance gains on a wide range of downstream datasets. In addition, we present a color sensitivity analysis that documents the impact of different training methods on model neurons and shows that the performance of the learned features is robust with respect to illuminant variations.
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.
    MTTN: Multi-Pair Text to Text Narratives for Prompt Generation. (arXiv:2301.10172v1 [cs.CL])
    The explosive popularity of diffusion models[ 1][ 2][ 3 ] has provided a huge stage for further development in generative-text modelling. As prompt based models are very nuanced, such that a carefully generated prompt can produce truely breath taking images, on the contrary producing powerful or even meaningful prompt is a hit or a miss. To lavish on this we have introduced a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO[4 ], Flickr[ 5], etc. We have also introduced staging for these sentences that sequentially reduce the context and increase the complexity, that will further strengthen the output because of the complex annotations that are being created. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations. The original 2.4M million pairs are broken down in such a manner that it produces a true scenario of internet lingo that is used globally thereby heightening the robustness of the dataset, and any model trained on it.
    RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving. (arXiv:2301.10222v1 [cs.CV])
    Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.
    A Watermark for Large Language Models. (arXiv:2301.10226v1 [cs.LG])
    Potential harms of large language models can be mitigated by watermarking model output, i.e., embedding signals into generated text that are invisible to humans but algorithmically detectable from a short span of tokens. We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality, and can be detected using an efficient open-source algorithm without access to the language model API or parameters. The watermark works by selecting a randomized set of whitelist tokens before a word is generated, and then softly promoting use of whitelist tokens during sampling. We propose a statistical test for detecting the watermark with interpretable p-values, and derive an information-theoretic framework for analyzing the sensitivity of the watermark. We test the watermark using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and discuss robustness and security.
    HTMOT : Hierarchical Topic Modelling Over Time. (arXiv:2112.03104v2 [cs.IR] UPDATED)
    Over the years, topic models have provided an efficient way of extracting insights from text. However, while many models have been proposed, none are able to model topic temporality and hierarchy jointly. Modelling time provide more precise topics by separating lexically close but temporally distinct topics while modelling hierarchy provides a more detailed view of the content of a document corpus. In this study, we therefore propose a novel method, HTMOT, to perform Hierarchical Topic Modelling Over Time. We train HTMOT using a new implementation of Gibbs sampling, which is more efficient. Specifically, we show that only applying time modelling to deep sub-topics provides a way to extract specific stories or events while high level topics extract larger themes in the corpus. Our results show that our training procedure is fast and can extract accurate high-level topics and temporally precise sub-topics. We measured our model's performance using the Word Intrusion task and outlined some limitations of this evaluation method, especially for hierarchical models. As a case study, we focused on the various developments in the space industry in 2020.
    Leveraging Vision-Language Models for Granular Market Change Prediction. (arXiv:2301.10166v1 [q-fin.ST])
    Predicting future direction of stock markets using the historical data has been a fundamental component in financial forecasting. This historical data contains the information of a stock in each specific time span, such as the opening, closing, lowest, and highest price. Leveraging this data, the future direction of the market is commonly predicted using various time-series models such as Long-Short Term Memory networks. This work proposes modeling and predicting market movements with a fundamentally new approach, namely by utilizing image and byte-based number representation of the stock data processed with the recently introduced Vision-Language models. We conduct a large set of experiments on the hourly stock data of the German share index and evaluate various architectures on stock price prediction using historical stock data. We conduct a comprehensive evaluation of the results with various metrics to accurately depict the actual performance of various approaches. Our evaluation results show that our novel approach based on representation of stock data as text (bytes) and image significantly outperforms strong deep learning-based baselines.
    WEASEL 2.0 -- A Random Dilated Dictionary Transform for Fast, Accurate and Memory Constrained Time Series Classification. (arXiv:2301.10194v1 [cs.LG])
    A time series is a sequence of sequentially ordered real values in time. Time series classification (TSC) is the task of assigning a time series to one of a set of predefined classes, usually based on a model learned from examples. Dictionary-based methods for TSC rely on counting the frequency of certain patterns in time series and are important components of the currently most accurate TSC ensembles. One of the early dictionary-based methods was WEASEL, which at its time achieved SotA results while also being very fast. However, it is outperformed both in terms of speed and accuracy by other methods. Furthermore, its design leads to an unpredictably large memory footprint, making it inapplicable for many applications. In this paper, we present WEASEL 2.0, a complete overhaul of WEASEL based on two recent advancements in TSC: Dilation and ensembling of randomized hyper-parameter settings. These two techniques allow WEASEL 2.0 to work with a fixed-size memory footprint while at the same time improving accuracy. Compared to 15 other SotA methods on the UCR benchmark set, WEASEL 2.0 is significantly more accurate than other dictionary methods and not significantly worse than the currently best methods. Actually, it achieves the highest median accuracy over all data sets, and it performs best in 5 out of 12 problem classes. We thus believe that WEASEL 2.0 is a viable alternative for current TSC and also a potentially interesting input for future ensembles.
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.
    Fine-grained Early Frequency Attention for Deep Speaker Representation Learning. (arXiv:2009.01822v2 [eess.AS] UPDATED)
    Deep learning techniques have considerably improved speech processing in recent years. Speaker representations extracted by deep learning models are being used in a wide range of tasks such as speaker recognition and speech emotion recognition. Attention mechanisms have started to play an important role in improving deep learning models in the field of speech processing. Nonetheless, despite the fact that important speaker-related information can be embedded in individual frequency-bins of the input spectral representations, current attention models are unable to attend to fine-grained information items in spectral representations. In this paper we propose Fine-grained Early Frequency Attention (FEFA) for speaker representation learning. Our model is a simple and lightweight model that can be integrated into various CNN pipelines and is capable of focusing on information items as small as frequency-bins. We evaluate the proposed model on three tasks of speaker recognition, speech emotion recognition, and spoken digit recognition. We use Three widely used public datasets, namely VoxCeleb, IEMOCAP, and Free Spoken Digit Dataset for our experiments. We attach FEFA to several prominent deep learning models and evaluate its impact on the final performance. We also compare our work with other related works in the area. Our experiments show that by adding FEFA to different CNN architectures, performance is consistently improved by substantial margins, and the models equipped with FEFA outperform all the other attentive models. We also test our model against different levels of added noise showing improvements in robustness and less sensitivity compared to the backbone networks.
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Topological Understanding of Neural Networks, a survey. (arXiv:2301.09742v1 [cs.LG])
    We look at the internal structure of neural networks which is usually treated as a black box. The easiest and the most comprehensible thing to do is to look at a binary classification and try to understand the approach a neural network takes. We review the significance of different activation functions, types of network architectures associated to them, and some empirical data. We find some interesting observations and a possibility to build upon the ideas to verify the process for real datasets. We suggest some possible experiments to look forward to in three different directions.
    A predictive physics-aware hybrid reduced order model for reacting flows. (arXiv:2301.09860v1 [cs.LG])
    In this work, a new hybrid predictive Reduced Order Model (ROM) is proposed to solve reacting flow problems. This algorithm is based on a dimensionality reduction using Proper Orthogonal Decomposition (POD) combined with deep learning architectures. The number of degrees of freedom is reduced from thousands of temporal points to a few POD modes with their corresponding temporal coefficients. Two different deep learning architectures have been tested to predict the temporal coefficients, based on recursive (RNN) and convolutional (CNN) neural networks. From each architecture, different models have been created to understand the behavior of each parameter of the neural network. Results show that these architectures are able to predict the temporal coefficients of the POD modes, as well as the whole snapshots. The RNN shows lower prediction error for all the variables analyzed. The model was also found capable of predicting more complex simulations showing transfer learning capabilities.
    Towards Modular Machine Learning Solution Development: Benefits and Trade-offs. (arXiv:2301.09753v1 [cs.LG])
    Machine learning technologies have demonstrated immense capabilities in various domains. They play a key role in the success of modern businesses. However, adoption of machine learning technologies has a lot of untouched potential. Cost of developing custom machine learning solutions that solve unique business problems is a major inhibitor to far-reaching adoption of machine learning technologies. We recognize that the monolithic nature prevalent in today's machine learning applications stands in the way of efficient and cost effective customized machine learning solution development. In this work we explore the benefits of modular machine learning solutions and discuss how modular machine learning solutions can overcome some of the major solution engineering limitations of monolithic machine learning solutions. We analyze the trade-offs between modular and monolithic machine learning solutions through three deep learning problems; one text based and the two image based. Our experimental results show that modular machine learning solutions have a promising potential to reap the solution engineering advantages of modularity while gaining performance and data advantages in a way the monolithic machine learning solutions do not permit.
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Interpretable Tsetlin Machine-based Premature Ventricular Contraction Identification. (arXiv:2301.10181v1 [eess.SP])
    Neural network-based models have found wide use in automatic long-term electrocardiogram (ECG) analysis. However, such black box models are inadequate for analysing physiological signals where credibility and interpretability are crucial. Indeed, how to make ECG analysis transparent is still an open problem. In this study, we develop a Tsetlin machine (TM) based architecture for premature ventricular contraction (PVC) identification by analysing long-term ECG signals. The architecture is transparent by describing patterns directly with logical AND rules. To validate the accuracy of our approach, we compare the TM performance with those of convolutional neural networks (CNNs). Our numerical results demonstrate that TM provides comparable performance with CNNs on the MIT-BIH database. To validate interpretability, we provide explanatory diagrams that show how TM makes the PVC identification from confirming and invalidating patterns. We argue that these are compatible with medical knowledge so that they can be readily understood and verified by a medical doctor. Accordingly, we believe this study paves the way for machine learning (ML) for ECG analysis in clinical practice.
    Inference of Continuous Linear Systems from Data with Guaranteed Stability. (arXiv:2301.10060v1 [cs.LG])
    Machine-learning technologies for learning dynamical systems from data play an important role in engineering design. This research focuses on learning continuous linear models from data. Stability, a key feature of dynamic systems, is especially important in design tasks such as prediction and control. Thus, there is a need to develop methodologies that provide stability guarantees. To that end, we leverage the parameterization of stable matrices proposed in [Gillis/Sharma, Automatica, 2017] to realize the desired models. Furthermore, to avoid the estimation of derivative information to learn continuous systems, we formulate the inference problem in an integral form. We also discuss a few extensions, including those related to control systems. Numerical experiments show that the combination of a stable matrix parameterization and an integral form of differential equations allows us to learn stable systems without requiring derivative information, which can be challenging to obtain in situations with noisy or limited data.
    Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning. (arXiv:2301.10119v1 [cs.LG])
    Learning models of the environment from pure interaction is often considered an essential component of building lifelong reinforcement learning agents. However, the common practice in model-based reinforcement learning is to learn models that model every aspect of the agent's environment, regardless of whether they are important in coming up with optimal decisions or not. In this paper, we argue that such models are not particularly well-suited for performing scalable and robust planning in lifelong reinforcement learning scenarios and we propose new kinds of models that only model the relevant aspects of the environment, which we call "minimal value-equivalent partial models". After providing a formal definition for these models, we provide theoretical results demonstrating the scalability advantages of performing planning with such models and then perform experiments to empirically illustrate our theoretical results. Then, we provide some useful heuristics on how to learn these kinds of models with deep learning architectures and empirically demonstrate that models learned in such a way can allow for performing planning that is robust to distribution shifts and compounding model errors. Overall, both our theoretical and empirical results suggest that minimal value-equivalent partial models can provide significant benefits to performing scalable and robust planning in lifelong reinforcement learning scenarios.  ( 2 min )
    Autonomous particles. (arXiv:2301.10077v1 [cs.LG])
    Consider a reinforcement learning problem where an agent has access to a very large amount of information about the environment, but it can only take very few actions to accomplish its task and to maximize its reward. Evidently, the main problem for the agent is to learn a map from a very high-dimensional space (which represents its environment) to a very low-dimensional space (which represents its actions). The high-to-low dimensional map implies that most of the information about the environment is irrelevant for the actions to be taken, and only a small fraction of information is relevant. In this paper we argue that the relevant information need not be learned by brute force (which is the standard approach), but can be identified from the intrinsic symmetries of the system. We analyze in details a reinforcement learning problem of autonomous driving, where the corresponding symmetry is the Galilean symmetry, and argue that the learning task can be accomplished with very few relevant parameters, or, more precisely, invariants. For a numerical demonstration, we show that the autonomous vehicles (which we call autonomous particles since they describe very primitive vehicles) need only four relevant invariants to learn how to drive very well without colliding with other particles. The simple model can be easily generalized to include different types of particles (e.g. for cars, for pedestrians, for buildings, for road signs, etc.) with different types of relevant invariants describing interactions between them. We also argue that there must exist a field theory description of the learning system where autonomous particles would be described by fermionic degrees of freedom and interactions mediated by the relevant invariants would be described by bosonic degrees of freedom.  ( 2 min )
    Differentiable bit-rate estimation for neural-based video codec enhancement. (arXiv:2301.09776v1 [eess.IV])
    Neural networks (NN) can improve standard video compression by pre- and post-processing the encoded video. For optimal NN training, the standard codec needs to be replaced with a codec proxy that can provide derivatives of estimated bit-rate and distortion, which are used for gradient back-propagation. Since entropy coding of standard codecs is designed to take into account non-linear dependencies between transform coefficients, bit-rates cannot be well approximated with simple per-coefficient estimators. This paper presents a new approach for bit-rate estimation that is similar to the type employed in training end-to-end neural codecs, and able to efficiently take into account those statistical dependencies. It is defined from a mathematical model that provides closed-form formulas for the estimates and their gradients, reducing the computational complexity. Experimental results demonstrate the method's accuracy in estimating HEVC/H.265 codec bit-rates.  ( 2 min )
    Explainable Data-Driven Optimization: From Context to Decision and Back Again. (arXiv:2301.10074v1 [cs.LG])
    Data-driven optimization uses contextual information and machine learning algorithms to find solutions to decision problems with uncertain parameters. While a vast body of work is dedicated to interpreting machine learning models in the classification setting, explaining decision pipelines involving learning algorithms remains unaddressed. This lack of interpretability can block the adoption of data-driven solutions as practitioners may not understand or trust the recommended decisions. We bridge this gap by introducing a counterfactual explanation methodology tailored to explain solutions to data-driven problems. We introduce two classes of explanations and develop methods to find nearest explanations of random forest and nearest-neighbor predictors. We demonstrate our approach by explaining key problems in operations management such as inventory management and routing.  ( 2 min )
    PolarAir: A Compressed Sensing Scheme for Over-the-Air Federated Learning. (arXiv:2301.10110v1 [cs.IT])
    We explore a scheme that enables the training of a deep neural network in a Federated Learning configuration over an additive white Gaussian noise channel. The goal is to create a low complexity, linear compression strategy, called PolarAir, that reduces the size of the gradient at the user side to lower the number of channel uses needed to transmit it. The suggested approach belongs to the family of compressed sensing techniques, yet it constructs the sensing matrix and the recovery procedure using multiple access techniques. Simulations show that it can reduce the number of channel uses by ~30% when compared to conveying the gradient without compression. The main advantage of the proposed scheme over other schemes in the literature is its low time complexity. We also investigate the behavior of gradient updates and the performance of PolarAir throughout the training process to obtain insight on how best to construct this compression scheme based on compressed sensing.  ( 2 min )
    Intrinsic Motivation in Model-based Reinforcement Learning: A Brief Review. (arXiv:2301.10067v1 [cs.LG])
    The reinforcement learning research area contains a wide range of methods for solving the problems of intelligent agent control. Despite the progress that has been made, the task of creating a highly autonomous agent is still a significant challenge. One potential solution to this problem is intrinsic motivation, a concept derived from developmental psychology. This review considers the existing methods for determining intrinsic motivation based on the world model obtained by the agent. We propose a systematic approach to current research in this field, which consists of three categories of methods, distinguished by the way they utilize a world model in the agent's components: complementary intrinsic reward, exploration policy, and intrinsically motivated goals. The proposed unified framework describes the architecture of agents using a world model and intrinsic motivation to improve learning. The potential for developing new techniques in this area of research is also examined.  ( 2 min )
    A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data. (arXiv:2301.10053v1 [cs.LG])
    Personal data collected at scale from surveys or digital devices offers important insights for statistical analysis and scientific research. Safely sharing such data while protecting privacy is however challenging. Anonymization allows data to be shared while minimizing privacy risks, but traditional anonymization techniques have been repeatedly shown to provide limited protection against re-identification attacks in practice. Among modern anonymization techniques, synthetic data generation (SDG) has emerged as a potential solution to find a good tradeoff between privacy and statistical utility. Synthetic data is typically generated using algorithms that learn the statistical distribution of the original records, to then generate "artificial" records that are structurally and statistically similar to the original ones. Yet, the fact that synthetic records are "artificial" does not, per se, guarantee that privacy is protected. In this work, we systematically evaluate the tradeoffs between protecting privacy and preserving statistical utility for a wide range of synthetic data generation algorithms. Modeling privacy as protection against attribute inference attacks (AIAs), we extend and adapt linear reconstruction attacks, which have not been previously studied in the context of synthetic data. While prior work suggests that AIAs may be effective only on few outlier records, we show they can be very effective even on randomly selected records. We evaluate attacks on synthetic datasets ranging from 10^3 to 10^6 records, showing that even for the same generative model, the attack effectiveness can drastically increase when a larger number of synthetic records is generated. Overall, our findings prove that synthetic data is subject to privacy-utility tradeoffs just like other anonymization techniques: when good utility is preserved, attribute inference can be a risk for many data subjects.  ( 2 min )
    SMART: Self-supervised Multi-task pretrAining with contRol Transformers. (arXiv:2301.09816v1 [cs.LG])
    Self-supervised pretraining has been extensively studied in language and vision domains, where a unified model can be easily adapted to various downstream tasks by pretraining representations without explicit labels. When it comes to sequential decision-making tasks, however, it is difficult to properly design such a pretraining approach that can cope with both high-dimensional perceptual information and the complexity of sequential control over long interaction horizons. The challenge becomes combinatorially more complex if we want to pretrain representations amenable to a large variety of tasks. To tackle this problem, in this work, we formulate a general pretraining-finetuning pipeline for sequential decision making, under which we propose a generic pretraining framework \textit{Self-supervised Multi-task pretrAining with contRol Transformer (SMART)}. By systematically investigating pretraining regimes, we carefully design a Control Transformer (CT) coupled with a novel control-centric pretraining objective in a self-supervised manner. SMART encourages the representation to capture the common essential information relevant to short-term control and long-term control, which is transferrable across tasks. We show by extensive experiments in DeepMind Control Suite that SMART significantly improves the learning efficiency among seen and unseen downstream tasks and domains under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality pretraining datasets that are randomly collected.  ( 2 min )
    Koopman neural operator as a mesh-free solver of non-linear partial differential equations. (arXiv:2301.10022v1 [cs.LG])
    The lacking of analytic solutions of diverse partial differential equations (PDEs) gives birth to series of computational techniques for numerical solutions. In machine learning, numerous latest advances of solver designs are accomplished in developing neural operators, a kind of mesh-free approximators of the infinite-dimensional operators that map between different parameterization spaces of equation solutions. Although neural operators exhibit generalization capacities for learning an entire PDE family simultaneously, they become less accurate and explainable while learning long-term behaviours of non-linear PDE families. In this paper, we propose Koopman neural operator (KNO), a new neural operator, to overcome these challenges. With the same objective of learning an infinite-dimensional mapping between Banach spaces that serves as the solution operator of target PDE family, our approach differs from existing models by formulating a non-linear dynamic system of equation solution. By approximating the Koopman operator, an infinite-dimensional linear operator governing all possible observations of the dynamic system, to act on the flow mapping of dynamic system, we can equivalently learn the solution of an entire non-linear PDE family by solving simple linear prediction problems. In zero-shot prediction and long-term prediction experiments on representative PDEs (e.g., the Navier-Stokes equation), KNO exhibits notable advantages in breaking the tradeoff between accuracy and efficiency (e.g., model size) while previous state-of-the-art models are limited. These results suggest that more efficient PDE solvers can be developed by the joint efforts from physics and machine learning.  ( 2 min )
    Robust Fair Clustering: A Novel Fairness Attack and Defense Framework. (arXiv:2210.01953v2 [cs.LG] UPDATED)
    Clustering algorithms are widely used in many societal resource allocation applications, such as loan approvals and candidate recruitment, among others, and hence, biased or unfair model outputs can adversely impact individuals that rely on these applications. To this end, many fair clustering approaches have been recently proposed to counteract this issue. Due to the potential for significant harm, it is essential to ensure that fair clustering algorithms provide consistently fair outputs even under adversarial influence. However, fair clustering algorithms have not been studied from an adversarial attack perspective. In contrast to previous research, we seek to bridge this gap and conduct a robustness analysis against fair clustering by proposing a novel black-box fairness attack. Through comprehensive experiments, we find that state-of-the-art models are highly susceptible to our attack as it can reduce their fairness performance significantly. Finally, we propose Consensus Fair Clustering (CFC), the first robust fair clustering approach that transforms consensus clustering into a fair graph partitioning problem, and iteratively learns to generate fair cluster outputs. Experimentally, we observe that CFC is highly robust to the proposed attack and is thus a truly robust fair clustering alternative.  ( 2 min )
    3D UX-Net: A Large Kernel Volumetric ConvNet Modernizing Hierarchical Transformer for Medical Image Segmentation. (arXiv:2209.15076v3 [cs.CV] UPDATED)
    The recent 3D medical ViTs (e.g., SwinUNETR) achieve the state-of-the-art performances on several 3D volumetric data benchmarks, including 3D medical image segmentation. Hierarchical transformers (e.g., Swin Transformers) reintroduced several ConvNet priors and further enhanced the practical viability of adapting volumetric segmentation in 3D medical datasets. The effectiveness of hybrid approaches is largely credited to the large receptive field for non-local self-attention and the large number of model parameters. In this work, we propose a lightweight volumetric ConvNet, termed 3D UX-Net, which adapts the hierarchical transformer using ConvNet modules for robust volumetric segmentation. Specifically, we revisit volumetric depth-wise convolutions with large kernel size (e.g. starting from $7\times7\times7$) to enable the larger global receptive fields, inspired by Swin Transformer. We further substitute the multi-layer perceptron (MLP) in Swin Transformer blocks with pointwise depth convolutions and enhance model performances with fewer normalization and activation layers, thus reducing the number of model parameters. 3D UX-Net competes favorably with current SOTA transformers (e.g. SwinUNETR) using three challenging public datasets on volumetric brain and abdominal imaging: 1) MICCAI Challenge 2021 FLARE, 2) MICCAI Challenge 2021 FeTA, and 3) MICCAI Challenge 2022 AMOS. 3D UX-Net consistently outperforms SwinUNETR with improvement from 0.929 to 0.938 Dice (FLARE2021) and 0.867 to 0.874 Dice (Feta2021). We further evaluate the transfer learning capability of 3D UX-Net with AMOS2022 and demonstrates another improvement of $2.27\%$ Dice (from 0.880 to 0.900). The source code with our proposed model are available at https://github.com/MASILab/3DUX-Net.  ( 2 min )
    Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression. (arXiv:2301.09830v1 [cs.LG])
    In training of modern large natural language processing (NLP) models, it has become a common practice to split models using 3D parallelism to multiple GPUs. Such technique, however, suffers from a high overhead of inter-node communication. Compressing the communication is one way to mitigate the overhead by reducing the inter-node traffic volume; however, the existing compression techniques have critical limitations to be applied for NLP models with 3D parallelism in that 1) only the data parallelism traffic is targeted, and 2) the existing compression schemes already harm the model quality too much. In this paper, we present Optimus-CC, a fast and scalable distributed training framework for large NLP models with aggressive communication compression. Optimus-CC differs from existing communication compression frameworks in the following ways: First, we compress pipeline parallel (inter-stage) traffic. In specific, we compress the inter-stage backpropagation and the embedding synchronization in addition to the existing data-parallel traffic compression methods. Second, we propose techniques to avoid the model quality drop that comes from the compression. We further provide mathematical and empirical analyses to show that our techniques can successfully suppress the compression error. Lastly, we analyze the pipeline and opt to selectively compress those traffic lying on the critical path. This further helps reduce the compression error. We demonstrate our solution on a GPU cluster, and achieve superior speedup from the baseline state-of-the-art solutions for distributed training without sacrificing the model quality.  ( 2 min )
    When does the student surpass the teacher? Federated Semi-supervised Learning with Teacher-Student EMA. (arXiv:2301.10114v1 [cs.LG])
    Semi-Supervised Learning (SSL) has received extensive attention in the domain of computer vision, leading to development of promising approaches such as FixMatch. In scenarios where training data is decentralized and resides on client devices, SSL must be integrated with privacy-aware training techniques such as Federated Learning. We consider the problem of federated image classification and study the performance and privacy challenges with existing federated SSL (FSSL) approaches. Firstly, we note that even state-of-the-art FSSL algorithms can trivially compromise client privacy and other real-world constraints such as client statelessness and communication cost. Secondly, we observe that it is challenging to integrate EMA (Exponential Moving Average) updates into the federated setting, which comes at a trade-off between performance and communication cost. We propose a novel approach FedSwitch, that improves privacy as well as generalization performance through Exponential Moving Average (EMA) updates. FedSwitch utilizes a federated semi-supervised teacher-student EMA framework with two features - local teacher adaptation and adaptive switching between teacher and student for pseudo-label generation. Our proposed approach outperforms the state-of-the-art on federated image classification, can be adapted to real-world constraints, and achieves good generalization performance with minimal communication cost overhead.  ( 2 min )
    Proceedings of the 1st International Workshop on Reading Music Systems. (arXiv:2301.10062v1 [cs.CV])
    The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 1st International Workshop on Reading Music Systems, held in Paris on the 20th of September 2018.  ( 2 min )
    Quantum Heavy-tailed Bandits. (arXiv:2301.09680v1 [cs.LG])
    In this paper, we study multi-armed bandits (MAB) and stochastic linear bandits (SLB) with heavy-tailed rewards and quantum reward oracle. Unlike the previous work on quantum bandits that assumes bounded/sub-Gaussian distributions for rewards, here we investigate the quantum bandits problem under a weaker assumption that the distributions of rewards only have bounded $(1+v)$-th moment for some $v\in (0,1]$. In order to achieve regret improvements for heavy-tailed bandits, we first propose a new quantum mean estimator for heavy-tailed distributions, which is based on the Quantum Monte Carlo Mean Estimator and achieves a quadratic improvement of estimation error compared to the classical one. Based on our quantum mean estimator, we focus on quantum heavy-tailed MAB and SLB and propose quantum algorithms based on the Upper Confidence Bound (UCB) framework for both problems with $\Tilde{O}(T^{\frac{1-v}{1+v}})$ regrets, polynomially improving the dependence in terms of $T$ as compared to classical (near) optimal regrets of $\Tilde{O}(T^{\frac{1}{1+v}})$, where $T$ is the number of rounds. Finally, experiments also support our theoretical results and show the effectiveness of our proposed methods.  ( 2 min )
    From Robots to Books: An Introduction to Smart Applications of AI in Education (AIEd). (arXiv:2301.10026v1 [cs.CY])
    The world around us has undergone a radical transformation due to rapid technological advancement in recent decades. The industry of the future generation is evolving, and artificial intelligence is the following change in the making popularly known as Industry 4.0. Indeed, experts predict that artificial intelligence(AI) will be the main force behind the following significant virtual shift in the way we stay, converse, study, live, communicate and conduct business. All facets of our social connection are being transformed by this growing technology. One of the newest areas of educational technology is Artificial Intelligence in the field of Education(AIEd).This study emphasizes the different applications of artificial intelligence in education from both an industrial and academic standpoint. It highlights the most recent contextualized learning novel transformative evaluations and advancements in sophisticated tutoring systems. It analyses the AIEd's ethical component and the influence of the transition on people, particularly students and instructors as well. Finally, this article touches on AIEd's potential future research and practices. The goal of this study is to introduce the present-day applications to its intended audience.  ( 2 min )
    FedPrompt: Communication-Efficient and Privacy Preserving Prompt Tuning in Federated Learning. (arXiv:2208.12268v3 [cs.LG] UPDATED)
    Federated learning (FL) has enabled global model training on decentralized data in a privacy-preserving way by aggregating model updates. However, for many natural language processing (NLP) tasks that utilize pre-trained language models (PLMs) with large numbers of parameters, there are considerable communication costs associated with FL. Recently, prompt tuning, which tunes some soft prompts without modifying PLMs, has achieved excellent performance as a new learning paradigm. Therefore we want to combine the two methods and explore the effect of prompt tuning under FL. In this paper, we propose "FedPrompt" to study prompt tuning in a model split aggregation way using FL, and prove that split aggregation greatly reduces the communication cost, only 0.01% of the PLMs' parameters, with little decrease on accuracy both on IID and Non-IID data distribution. This improves the efficiency of FL method while also protecting the data privacy in prompt tuning. In addition, like PLMs, prompts are uploaded and downloaded between public platforms and personal users, so we try to figure out whether there is still a backdoor threat using only soft prompts in FL scenarios. We further conduct backdoor attacks by data poisoning on FedPrompt. Our experiments show that normal backdoor attack can not achieve a high attack success rate, proving the robustness of FedPrompt. We hope this work can promote the application of prompt in FL and raise the awareness of the possible security threats.  ( 2 min )
    Efficient learning of large sets of locally optimal classification rules. (arXiv:2301.09936v1 [cs.LG])
    Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach's average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.  ( 2 min )
    Model Agnostic Sample Reweighting for Out-of-Distribution Learning. (arXiv:2301.09819v1 [cs.LG])
    Distributionally robust optimization (DRO) and invariant risk minimization (IRM) are two popular methods proposed to improve out-of-distribution (OOD) generalization performance of machine learning models. While effective for small models, it has been observed that these methods can be vulnerable to overfitting with large overparameterized models. This work proposes a principled method, \textbf{M}odel \textbf{A}gnostic sam\textbf{PL}e r\textbf{E}weighting (\textbf{MAPLE}), to effectively address OOD problem, especially in overparameterized scenarios. Our key idea is to find an effective reweighting of the training samples so that the standard empirical risk minimization training of a large model on the weighted training data leads to superior OOD generalization performance. The overfitting issue is addressed by considering a bilevel formulation to search for the sample reweighting, in which the generalization complexity depends on the search space of sample weights instead of the model size. We present theoretical analysis in linear case to prove the insensitivity of MAPLE to model size, and empirically verify its superiority in surpassing state-of-the-art methods by a large margin. Code is available at \url{https://github.com/x-zho14/MAPLE}.  ( 2 min )
    Spectral Cross-Domain Neural Network with Soft-adaptive Threshold Spectral Enhancement. (arXiv:2301.10171v1 [cs.LG])
    Electrocardiography (ECG) signals can be considered as multi-variable time-series. The state-of-the-art ECG data classification approaches, based on either feature engineering or deep learning techniques, treat separately spectral and time domains in machine learning systems. No spectral-time domain communication mechanism inside the classifier model can be found in current approaches, leading to difficulties in identifying complex ECG forms. In this paper, we proposed a novel deep learning model named Spectral Cross-domain neural network (SCDNN) with a new block called Soft-adaptive threshold spectral enhancement (SATSE), to simultaneously reveal the key information embedded in spectral and time domains inside the neural network. More precisely, the domain-cross information is captured by a general Convolutional neural network (CNN) backbone, and different information sources are merged by a self-adaptive mechanism to mine the connection between time and spectral domains. In SATSE, the knowledge from time and spectral domains is extracted via the Fast Fourier Transformation (FFT) with soft trainable thresholds in modified Sigmoid functions. The proposed SCDNN is tested with several classification tasks implemented on the public ECG databases \textit{PTB-XL} and \textit{MIT-BIH}. SCDNN outperforms the state-of-the-art approaches with a low computational cost regarding a variety of metrics in all classification tasks on both databases, by finding appropriate domains from the infinite spectral mapping. The convergence of the trainable thresholds in the spectral domain is also numerically investigated in this paper. The robust performance of SCDNN provides a new perspective to exploit knowledge across deep learning models from time and spectral domains. The repository can be found: https://github.com/DL-WG/SCDNN-TS  ( 2 min )
    Multi-view Kernel PCA for Time series Forecasting. (arXiv:2301.09811v1 [cs.LG])
    In this paper, we propose a kernel principal component analysis model for multi-variate time series forecasting, where the training and prediction schemes are derived from the multi-view formulation of Restricted Kernel Machines. The training problem is simply an eigenvalue decomposition of the summation of two kernel matrices corresponding to the views of the input and output data. When a linear kernel is used for the output view, it is shown that the forecasting equation takes the form of kernel ridge regression. When that kernel is non-linear, a pre-image problem has to be solved to forecast a point in the input space. We evaluate the model on several standard time series datasets, perform ablation studies, benchmark with closely related models and discuss its results.  ( 2 min )
    Quantification of Damage Using Indirect Structural Health Monitoring. (arXiv:2301.09791v1 [cs.LG])
    Structural health monitoring is important to make sure bridges do not fail. Since direct monitoring can be complicated and expensive, indirect methods have been a focus on research. Indirect monitoring can be much cheaper and easier to conduct, however there are challenges with getting accurate results. This work focuses on damage quantification by using accelerometers. Tests were conducted on a model bridge and car with four accelerometers attached to to the vehicle. Different weights were placed on the bridge to simulate different levels of damage, and 31 tests were run for 20 different damage levels. The acceleration data collected was normalized and a Fast-Fourier Transform (FFT) was performed on that data. Both the normalized acceleration data and the normalized FFT data were inputted into a Non-Linear Principal Component Analysis (separately) and three principal components were extracted for each data set. Support Vector Regression (SVR) and Gaussian Process Regression (GPR) were used as the supervised machine learning methods to develop models. Multiple models were created so that the best one could be selected, and the models were compared by looking at their Mean Squared Errors (MSE). This methodology should be applied in the field to measure how effective it can be in real world applications.  ( 2 min )
    Feature-based Image Matching for Identifying Individual K\=ak\=a. (arXiv:2301.06678v2 [cs.CV] UPDATED)
    This report investigates an unsupervised, feature-based image matching pipeline for the novel application of identifying individual k\=ak\=a. Applied with a similarity network for clustering, this addresses a weakness of current supervised approaches to identifying individual birds which struggle to handle the introduction of new individuals to the population. Our approach uses object localisation to locate k\=ak\=a within images and then extracts local features that are invariant to rotation and scale. These features are matched between images with nearest neighbour matching techniques and mismatch removal to produce a similarity score for image match comparison. The results show that matches obtained via the image matching pipeline achieve high accuracy of true matches. We conclude that feature-based image matching could be used with a similarity network to provide a viable alternative to existing supervised approaches.  ( 2 min )
    Accurate Detection of Paroxysmal Atrial Fibrillation with Certified-GAN and Neural Architecture Search. (arXiv:2301.10173v1 [cs.LG])
    This paper presents a novel machine learning framework for detecting Paroxysmal Atrial Fibrillation (PxAF), a pathological characteristic of Electrocardiogram (ECG) that can lead to fatal conditions such as heart attack. To enhance the learning process, the framework involves a Generative Adversarial Network (GAN) along with a Neural Architecture Search (NAS) in the data preparation and classifier optimization phases. The GAN is innovatively invoked to overcome the class imbalance of the training data by producing the synthetic ECG for PxAF class in a certified manner. The effect of the certified GAN is statistically validated. Instead of using a general-purpose classifier, the NAS automatically designs a highly accurate convolutional neural network architecture customized for the PxAF classification task. Experimental results show that the accuracy of the proposed framework exhibits a high value of 99% which not only enhances state-of-the-art by up to 5.1%, but also improves the classification performance of the two widely-accepted baseline methods, ResNet-18, and Auto-Sklearn, by 2.2% and 6.1%.  ( 2 min )
    Dataset Bias in Human Activity Recognition. (arXiv:2301.10161v1 [eess.SP])
    When creating multi-channel time-series datasets for Human Activity Recognition (HAR), researchers are faced with the issue of subject selection criteria. It is unknown what physical characteristics and/or soft-biometrics, such as age, height, and weight, need to be taken into account to train a classifier to achieve robustness towards heterogeneous populations in the training and testing data. This contribution statistically curates the training data to assess to what degree the physical characteristics of humans influence HAR performance. We evaluate the performance of a state-of-the-art convolutional neural network on two HAR datasets that vary in the sensors, activities, and recording for time-series HAR. The training data is intentionally biased with respect to human characteristics to determine the features that impact motion behaviour. The evaluations brought forth the impact of the subjects' characteristics on HAR. Thus, providing insights regarding the robustness of the classifier with respect to heterogeneous populations. The study is a step forward in the direction of fair and trustworthy artificial intelligence by attempting to quantify representation bias in multi-channel time series HAR data.  ( 2 min )
    Domain generalization in deep learning-based mass detection in mammography: A large-scale multi-center study. (arXiv:2201.11620v2 [eess.IV] CROSS LISTED)
    Computer-aided detection systems based on deep learning have shown great potential in breast cancer detection. However, the lack of domain generalization of artificial neural networks is an important obstacle to their deployment in changing clinical environments. In this work, we explore the domain generalization of deep learning methods for mass detection in digital mammography and analyze in-depth the sources of domain shift in a large-scale multi-center setting. To this end, we compare the performance of eight state-of-the-art detection methods, including Transformer-based models, trained in a single domain and tested in five unseen domains. Moreover, a single-source mass detection training pipeline is designed to improve the domain generalization without requiring images from the new domain. The results show that our workflow generalizes better than state-of-the-art transfer learning-based approaches in four out of five domains while reducing the domain shift caused by the different acquisition protocols and scanner manufacturers. Subsequently, an extensive analysis is performed to identify the covariate shifts with bigger effects on the detection performance, such as due to differences in patient age, breast density, mass size, and mass malignancy. Ultimately, this comprehensive study provides key insights and best practices for future research on domain generalization in deep learning-based breast cancer detection.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.
    LDMIC: Learning-based Distributed Multi-view Image Coding. (arXiv:2301.09799v1 [eess.IV])
    Multi-view image compression plays a critical role in 3D-related applications. Existing methods adopt a predictive coding architecture, which requires joint encoding to compress the corresponding disparity as well as residual information. This demands collaboration among cameras and enforces the epipolar geometric constraint between different views, which makes it challenging to deploy these methods in distributed camera systems with randomly overlapping fields of view. Meanwhile, distributed source coding theory indicates that efficient data compression of correlated sources can be achieved by independent encoding and joint decoding, which motivates us to design a learning-based distributed multi-view image coding (LDMIC) framework. With independent encoders, LDMIC introduces a simple yet effective joint context transfer module based on the cross-attention mechanism at the decoder to effectively capture the global inter-view correlations, which is insensitive to the geometric relationships between images. Experimental results show that LDMIC significantly outperforms both traditional and learning-based MIC methods while enjoying fast encoding speed. Code will be released at https://github.com/Xinjie-Q/LDMIC.  ( 2 min )
    Explainable Deep Reinforcement Learning: State of the Art and Challenges. (arXiv:2301.09937v1 [cs.LG])
    Interpretability, explainability and transparency are key issues to introducing Artificial Intelligence methods in many critical domains: This is important due to ethical concerns and trust issues strongly connected to reliability, robustness, auditability and fairness, and has important consequences towards keeping the human in the loop in high levels of automation, especially in critical cases for decision making, where both (human and the machine) play important roles. While the research community has given much attention to explainability of closed (or black) prediction boxes, there are tremendous needs for explainability of closed-box methods that support agents to act autonomously in the real world. Reinforcement learning methods, and especially their deep versions, are such closed-box methods. In this article we aim to provide a review of state of the art methods for explainable deep reinforcement learning methods, taking also into account the needs of human operators - i.e., of those that take the actual and critical decisions in solving real-world problems. We provide a formal specification of the deep reinforcement learning explainability problems, and we identify the necessary components of a general explainable reinforcement learning framework. Based on these, we provide a comprehensive review of state of the art methods, categorizing them in classes according to the paradigm they follow, the interpretable models they use, and the surface representation of explanations provided. The article concludes identifying open questions and important challenges.  ( 2 min )
    Predicting Socio-Economic Well-being Using Mobile Apps Data: A Case Study of France. (arXiv:2301.09986v1 [cs.CY])
    Socio-economic indicators provide context for assessing a country's overall condition. These indicators contain information about education, gender, poverty, employment, and other factors. Therefore, reliable and accurate information is critical for social research and government policing. Most data sources available today, such as censuses, have sparse population coverage or are updated infrequently. Nonetheless, alternative data sources, such as call data records (CDR) and mobile app usage, can serve as cost-effective and up-to-date sources for identifying socio-economic indicators. This work investigates mobile app data to predict socio-economic features. We present a large-scale study using data that captures the traffic of thousands of mobile applications by approximately 30 million users distributed over 550,000 km square and served by over 25,000 base stations. The dataset covers the whole France territory and spans more than 2.5 months, starting from 16th March 2019 to 6th June 2019. Using the app usage patterns, our best model can estimate socio-economic indicators (attaining an R-squared score upto 0.66). Furthermore, using models' explainability, we discover that mobile app usage patterns have the potential to reveal socio-economic disparities in IRIS. Insights of this study provide several avenues for future interventions, including users' temporal network analysis and exploration of alternative data sources.  ( 2 min )
    Investigating Labeler Bias in Face Annotation for Machine Learning. (arXiv:2301.09902v1 [cs.LG])
    In a world increasingly reliant on artificial intelligence, it is more important than ever to consider the ethical implications of artificial intelligence on humanity. One key under-explored challenge is labeler bias, which can create inherently biased datasets for training and subsequently lead to inaccurate or unfair decisions in healthcare, employment, education, and law enforcement. Hence, we conducted a study to investigate and measure the existence of labeler bias using images of people from different ethnicities and sexes in a labeling task. Our results show that participants possess stereotypes that influence their decision-making process and that labeler demographics impact assigned labels. We also discuss how labeler bias influences datasets and, subsequently, the models trained on them. Overall, a high degree of transparency must be maintained throughout the entire artificial intelligence training process to identify and correct biases in the data as early as possible.  ( 2 min )
    Membership Inference of Diffusion Models. (arXiv:2301.09956v1 [cs.CR])
    Recent years have witnessed the tremendous success of diffusion models in data synthesis. However, when diffusion models are applied to sensitive data, they also give rise to severe privacy concerns. In this paper, we systematically present the first study about membership inference attacks against diffusion models, which aims to infer whether a sample was used to train the model. Two attack methods are proposed, namely loss-based and likelihood-based attacks. Our attack methods are evaluated on several state-of-the-art diffusion models, over different datasets in relation to privacy-sensitive data. Extensive experimental evaluations show that our attacks can achieve remarkable performance. Furthermore, we exhaustively investigate various factors which can affect attack performance. Finally, we also evaluate the performance of our attack methods on diffusion models trained with differential privacy.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks: Applications in neutron transport. (arXiv:2301.09991v1 [cs.CE])
    In this paper we solve the Boltzmann transport equation using AI libraries. The reason why this is attractive is because it enables one to use the highly optimised software within AI libraries, enabling one to run on different computer architectures and enables one to tap into the vast quantity of community based software that has been developed for AI and ML applications e.g. mixed arithmetic precision or model parallelism. Here we take the first steps towards developing this approach for the Boltzmann transport equation and develop the necessary methods in order to do that effectively. This includes: 1) A space-angle multigrid solution method that can extract the level of parallelism necessary to run efficiently on GPUs or new AI computers. 2) A new Convolutional Finite Element Method (ConvFEM) that greatly simplifies the implementation of high order finite elements (quadratic to quintic, say). 3) A new non-linear Petrov-Galerkin method that introduces dissipation anisotropically.  ( 2 min )
    Fair and skill-diverse student group formation via constrained k-way graph partitioning. (arXiv:2301.09984v1 [cs.LG])
    Forming the right combination of students in a group promises to enable a powerful and effective environment for learning and collaboration. However, defining a group of students is a complex task which has to satisfy multiple constraints. This work introduces an unsupervised algorithm for fair and skill-diverse student group formation. This is achieved by taking account of student course marks and sensitive attributes provided by the education office. The skill sets of students are determined using unsupervised dimensionality reduction of course mark data via the Laplacian eigenmap. The problem is formulated as a constrained graph partitioning problem, whereby the diversity of skill sets in each group are maximised, group sizes are upper and lower bounded according to available resources, and `balance' of a sensitive attribute is lower bounded to enforce fairness in group formation. This optimisation problem is solved using integer programming and its effectiveness is demonstrated on a dataset of student course marks from Imperial College London.  ( 2 min )
    The Backpropagation algorithm for a math student. (arXiv:2301.09977v1 [cs.LG])
    A Deep Neural Network (DNN) is a composite function of vector-valued functions, and in order to train a DNN, it is necessary to calculate the gradient of the loss function with respect to all parameters. This calculation can be a non-trivial task because the loss function of a DNN is a composition of several nonlinear functions, each with numerous parameters. The Backpropagation (BP) algorithm leverages the composite structure of the DNN to efficiently compute the gradient. As a result, the number of layers in the network does not significantly impact the complexity of the calculation. The objective of this paper is to express the gradient of the loss function in terms of a matrix multiplication using the Jacobian operator. This can be achieved by considering the total derivative of each layer with respect to its parameters and expressing it as a Jacobian matrix. The gradient can then be represented as the matrix product of these Jacobian matrices. This approach is valid because the chain rule can be applied to a composition of vector-valued functions, and the use of Jacobian matrices allows for the incorporation of multiple inputs and outputs. By providing concise mathematical justifications, the results can be made understandable and useful to a broad audience from various disciplines.  ( 2 min )
    Probabilistic Bilevel Coreset Selection. (arXiv:2301.09880v1 [cs.LG])
    The goal of coreset selection in supervised learning is to produce a weighted subset of data, so that training only on the subset achieves similar performance as training on the entire dataset. Existing methods achieved promising results in resource-constrained scenarios such as continual learning and streaming. However, most of the existing algorithms are limited to traditional machine learning models. A few algorithms that can handle large models adopt greedy search approaches due to the difficulty in solving the discrete subset selection problem, which is computationally costly when coreset becomes larger and often produces suboptimal results. In this work, for the first time we propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample. The overall objective is posed as a bilevel optimization problem, where 1) the inner loop samples coresets and train the model to convergence and 2) the outer loop updates the sample probability progressively according to the model's performance. Importantly, we develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation. We provide the convergence property of our training procedure and demonstrate the superiority of our algorithm against various coreset selection methods in various tasks, especially in more challenging label-noise and class-imbalance scenarios.  ( 2 min )
    A two stages Deep Learning Architecture for Model Reduction of Parametric Time-Dependent Problems. (arXiv:2301.09926v1 [math.NA])
    Parametric time-dependent systems are of a crucial importance in modeling real phenomena, often characterized by non-linear behaviors too. Those solutions are typically difficult to generalize in a sufficiently wide parameter space while counting on limited computational resources available. As such, we present a general two-stages deep learning framework able to perform that generalization with low computational effort in time. It consists in a separated training of two pipe-lined predictive models. At first, a certain number of independent neural networks are trained with data-sets taken from different subsets of the parameter space. Successively, a second predictive model is specialized to properly combine the first-stage guesses and compute the right predictions. Promising results are obtained applying the framework to incompressible Navier-Stokes equations in a cavity (Rayleigh-Bernard cavity), obtaining a 97% reduction in the computational time comparing with its numerical resolution for a new value of the Grashof number.  ( 2 min )
    Neighborhood Homophily-Guided Graph Convolutional Network. (arXiv:2301.09851v1 [cs.LG])
    Graph neural networks (GNNs) have achieved remarkable advances in graph-oriented tasks. However, many real-world graphs contain heterophily or low homophily, challenging the homophily assumption of classical GNNs and resulting in low performance. Although many studies have emerged to improve the universality of GNNs, they rarely consider the label reuse and the correlation of their proposed metrics and models. In this paper, we first design a new metric, named Neighborhood Homophily (\textit{NH}), to measure the label complexity or purity in the neighborhood of nodes. Furthermore, we incorporate this metric into the classical graph convolutional network (GCN) architecture and propose \textbf{N}eighborhood \textbf{H}omophily-\textbf{G}uided \textbf{G}raph \textbf{C}onvolutional \textbf{N}etwork (\textbf{NHGCN}). In this framework, nodes are grouped by estimated \textit{NH} values to achieve intra-group weight sharing during message propagation and aggregation. Then the generated node predictions are used to estimate and update new \textit{NH} values. The two processes of metric estimation and model inference are alternately optimized to achieve better node classification. Extensive experiments on both homophilous and heterophilous benchmarks demonstrate that \textbf{NHGCN} achieves state-of-the-art overall performance on semi-supervised node classification for the universality problem.  ( 2 min )
    Learning To Dive In Branch And Bound. (arXiv:2301.09943v1 [cs.LG])
    Primal heuristics are important for solving mixed integer linear programs, because they find feasible solutions that facilitate branch and bound search. A prominent group of primal heuristics are diving heuristics. They iteratively modify and resolve linear programs to conduct a depth-first search from any node in the search tree. Existing divers rely on generic decision rules that fail to exploit structural commonality between similar problem instances that often arise in practice. Therefore, we propose L2Dive to learn specific diving heuristics with graph neural networks: We train generative models to predict variable assignments and leverage the duality of linear programs to make diving decisions based on the model's predictions. L2Dive is fully integrated into the open-source solver SCIP. We find that L2Dive outperforms standard divers to find better feasible solutions on a range of combinatorial optimization problems. For real-world applications from server load balancing and neural network verification, L2Dive improves the primal-dual integral by up to 7% (35%) on average over a tuned (default) solver baseline and reduces average solving time by 20% (29%).  ( 2 min )
    Same or Different? Diff-Vectors for Authorship Analysis. (arXiv:2301.09862v1 [cs.LG])
    We investigate the effects on authorship identification tasks of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In ``classic'' authorship analysis a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that in some cases (e.g., authorship verification) it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block.  ( 2 min )
    Solving the Discretised Neutron Diffusion Equations using Neural Networks. (arXiv:2301.09939v1 [cs.CE])
    This paper presents a new approach which uses the tools within Artificial Intelligence (AI) software libraries as an alternative way of solving partial differential equations (PDEs) that have been discretised using standard numerical methods. In particular, we describe how to represent numerical discretisations arising from the finite volume and finite element methods by pre-determining the weights of convolutional layers within a neural network. As the weights are defined by the discretisation scheme, no training of the network is required and the solutions obtained are identical (accounting for solver tolerances) to those obtained with standard codes often written in Fortran or C++. We also explain how to implement the Jacobi method and a multigrid solver using the functions available in AI libraries. For the latter, we use a U-Net architecture which is able to represent a sawtooth multigrid method. A benefit of using AI libraries in this way is that one can exploit their power and their built-in technologies. For example, their executions are already optimised for different computer architectures, whether it be CPUs, GPUs or new-generation AI processors. In this article, we apply the proposed approach to eigenvalue problems in reactor physics where neutron transport is described by diffusion theory. For a fuel assembly benchmark, we demonstrate that the solution obtained from our new approach is the same (accounting for solver tolerances) as that obtained from the same discretisation coded in a standard way using Fortran. We then proceed to solve a reactor core benchmark using the new approach.  ( 2 min )
    A Stability Analysis of Fine-Tuning a Pre-Trained Model. (arXiv:2301.09820v1 [cs.LG])
    Fine-tuning a pre-trained model (such as BERT, ALBERT, RoBERTa, T5, GPT, etc.) has proven to be one of the most promising paradigms in recent NLP research. However, numerous recent works indicate that fine-tuning suffers from the instability problem, i.e., tuning the same model under the same setting results in significantly different performance. Many recent works have proposed different methods to solve this problem, but there is no theoretical understanding of why and how these methods work. In this paper, we propose a novel theoretical stability analysis of fine-tuning that focuses on two commonly used settings, namely, full fine-tuning and head tuning. We define the stability under each setting and prove the corresponding stability bounds. The theoretical bounds explain why and how several existing methods can stabilize the fine-tuning procedure. In addition to being able to explain most of the observed empirical discoveries, our proposed theoretical analysis framework can also help in the design of effective and provable methods. Based on our theory, we propose three novel strategies to stabilize the fine-tuning procedure, namely, Maximal Margin Regularizer (MMR), Multi-Head Loss (MHLoss), and Self Unsupervised Re-Training (SURT). We extensively evaluate our proposed approaches on 11 widely used real-world benchmark datasets, as well as hundreds of synthetic classification datasets. The experiment results show that our proposed methods significantly stabilize the fine-tuning procedure and also corroborate our theoretical analysis.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Noisy Parallel Data Alignment. (arXiv:2301.09685v1 [cs.CL])
    An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.  ( 2 min )
    Data Augmentation Alone Can Improve Adversarial Training. (arXiv:2301.09879v1 [cs.CV])
    Adversarial training suffers from the issue of robust overfitting, which seriously impairs its generalization performance. Data augmentation, which is effective at preventing overfitting in standard training, has been observed by many previous works to be ineffective in mitigating overfitting in adversarial training. This work proves that, contrary to previous findings, data augmentation alone can significantly boost accuracy and robustness in adversarial training. We find that the hardness and the diversity of data augmentation are important factors in combating robust overfitting. In general, diversity can improve both accuracy and robustness, while hardness can boost robustness at the cost of accuracy within a certain limit and degrade them both over that limit. To mitigate robust overfitting, we first propose a new crop transformation, Cropshift, which has improved diversity compared to the conventional one (Padcrop). We then propose a new data augmentation scheme, based on Cropshift, with much improved diversity and well-balanced hardness. Empirically, our augmentation method achieves the state-of-the-art accuracy and robustness for data augmentations in adversarial training. Furthermore, when combined with weight averaging it matches, or even exceeds, the performance of the best contemporary regularization methods for alleviating robust overfitting. Code is available at: https://github.com/TreeLLi/DA-Alone-Improves-AT.  ( 2 min )
    Topological Structure is Predictive of Deep Neural Network Success in Learning. (arXiv:2301.09734v1 [cs.LG])
    Machine learning has become a fundamental tool in modern science, yet its limitations are still not fully understood. Using a simple children's game, we show that the topological structure of the underlying training data can have a dramatic effect on the ability of a deep neural network (DNN) classifier to learn to classify data. We then take insights obtained from this toy model and apply them to two physical data sets (one from particle physics and one from acoustics), which are known to be amenable to classification by DNN's. We show that the simplicity in their topological structure explains the majority of the DNN's ability to operate on these data sets by showing that fully interpretable topological classifiers are able to perform nearly as well as their DNN counterparts.  ( 2 min )
    Slice-and-Forge: Making Better Use of Caches for Graph Convolutional Network Accelerators. (arXiv:2301.09813v1 [cs.LG])
    Graph convolutional networks (GCNs) are becoming increasingly popular as they can process a wide variety of data formats that prior deep neural networks cannot easily support. One key challenge in designing hardware accelerators for GCNs is the vast size and randomness in their data access patterns which greatly reduces the effectiveness of the limited on-chip cache. Aimed at improving the effectiveness of the cache by mitigating the irregular data accesses, prior studies often employ the vertex tiling techniques used in traditional graph processing applications. While being effective at enhancing the cache efficiency, those approaches are often sensitive to the tiling configurations where the optimal setting heavily depends on target input datasets. Furthermore, the existing solutions require manual tuning through trial-and-error or rely on sub-optimal analytical models. In this paper, we propose Slice-and-Forge (SnF), an efficient hardware accelerator for GCNs which greatly improves the effectiveness of the limited on-chip cache. SnF chooses a tiling strategy named feature slicing that splits the features into vertical slices and processes them in the outermost loop of the execution. This particular choice results in a repetition of the identical computational patterns over irregular graph data over multiple rounds. Taking advantage of such repetitions, SnF dynamically tunes its tile size. Our experimental results reveal that SnF can achieve 1.73x higher performance in geomean compared to prior work on multi-engine settings, and 1.46x higher performance in geomean on small scale settings, without the need for off-line analyses.  ( 2 min )
    Gossiped and Quantized Online Multi-Kernel Learning. (arXiv:2301.09848v1 [cs.LG])
    In instances of online kernel learning where little prior information is available and centralized learning is unfeasible, past research has shown that distributed and online multi-kernel learning provides sub-linear regret as long as every pair of nodes in the network can communicate (i.e., the communications network is a complete graph). In addition, to manage the communication load, which is often a performance bottleneck, communications between nodes can be quantized. This letter expands on these results to non-fully connected graphs, which is often the case in wireless sensor networks. To address this challenge, we propose a gossip algorithm and provide a proof that it achieves sub-linear regret. Experiments with real datasets confirm our findings.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Backdoor Attacks in Peer-to-Peer Federated Learning. (arXiv:2301.09732v1 [cs.LG])
    We study backdoor attacks in peer-to-peer federated learning systems on different graph topologies and datasets. We show that only 5% attacker nodes are sufficient to perform a backdoor attack with 42% attack success without decreasing the accuracy on clean data by more than 2%. We also demonstrate that the attack can be amplified by the attacker crashing a small number of nodes. We evaluate defenses proposed in the context of centralized federated learning and show they are ineffective in peer-to-peer settings. Finally, we propose a defense that mitigates the attacks by applying different clipping norms to the model updates received from peers and local model trained by a node.  ( 2 min )
    Truveta Mapper: A Zero-shot Ontology Alignment Framework. (arXiv:2301.09767v1 [cs.LG])
    In this paper, a new perspective is suggested for unsupervised Ontology Matching (OM) or Ontology Alignment (OA) by treating it as a translation task. Ontologies are represented as graphs, and the translation is performed from a node in the source ontology graph to a path in the target ontology graph. The proposed framework, Truveta Mapper (TM), leverages a multi-task sequence-to-sequence transformer model to perform alignment across multiple ontologies in a zero-shot, unified and end-to-end manner. Multi-tasking enables the model to implicitly learn the relationship between different ontologies via transfer-learning without requiring any explicit cross-ontology manually labeled data. This also enables the formulated framework to outperform existing solutions for both runtime latency and alignment quality. The model is pre-trained and fine-tuned only on publicly available text corpus and inner-ontologies data. The proposed solution outperforms state-of-the-art approaches, Edit-Similarity, LogMap, AML, BERTMap, and the recently presented new OM frameworks in Ontology Alignment Evaluation Initiative (OAEI22), offers log-linear complexity in contrast to quadratic in the existing end-to-end methods, and overall makes the OM task efficient and more straightforward without much post-processing involving mapping extension or mapping repair.  ( 2 min )
    Constrained Reinforcement Learning for Dexterous Manipulation. (arXiv:2301.09766v1 [cs.RO])
    Existing learning approaches to dexterous manipulation use demonstrations or interactions with the environment to train black-box neural networks that provide little control over how the robot learns the skills or how it would perform post training. These approaches pose significant challenges when implemented on physical platforms given that, during initial stages of training, the robot's behavior could be erratic and potentially harmful to its own hardware, the environment, or any humans in the vicinity. A potential way to address these limitations is to add constraints during learning that restrict and guide the robot's behavior during training as well as roll outs. Inspired by the success of constrained approaches in other domains, we investigate the effects of adding position-based constraints to a 24-DOF robot hand learning to perform object relocation using Constrained Policy Optimization. We find that a simple geometric constraint can ensure the robot learns to move towards the object sooner than without constraints. Further, training with this constraint requires a similar number of samples as its unconstrained counterpart to master the skill. These findings shed light on how simple constraints can help robots achieve sensible and safe behavior quickly and ease concerns surrounding hardware deployment. We also investigate the effects of the strictness of these constraints and report findings that provide insights into how different degrees of strictness affect learning outcomes. Our code is available at https://github.com/GT-STAR-Lab/constrained-rl-dexterous-manipulation.  ( 2 min )
    DODEM: DOuble DEfense Mechanism Against Adversarial Attacks Towards Secure Industrial Internet of Things Analytics. (arXiv:2301.09740v1 [cs.CR])
    Industrial Internet of Things (I-IoT) is a collaboration of devices, sensors, and networking equipment to monitor and collect data from industrial operations. Machine learning (ML) methods use this data to make high-level decisions with minimal human intervention. Data-driven predictive maintenance (PDM) is a crucial ML-based I-IoT application to find an optimal maintenance schedule for industrial assets. The performance of these ML methods can seriously be threatened by adversarial attacks where an adversary crafts perturbed data and sends it to the ML model to deteriorate its prediction performance. The models should be able to stay robust against these attacks where robustness is measured by how much perturbation in input data affects model performance. Hence, there is a need for effective defense mechanisms that can protect these models against adversarial attacks. In this work, we propose a double defense mechanism to detect and mitigate adversarial attacks in I-IoT environments. We first detect if there is an adversarial attack on a given sample using novelty detection algorithms. Then, based on the outcome of our algorithm, marking an instance as attack or normal, we select adversarial retraining or standard training to provide a secondary defense layer. If there is an attack, adversarial retraining provides a more robust model, while we apply standard training for regular samples. Since we may not know if an attack will take place, our adaptive mechanism allows us to consider irregular changes in data. The results show that our double defense strategy is highly efficient where we can improve model robustness by up to 64.6% and 52% compared to standard and adversarial retraining, respectively.  ( 2 min )
    Long-term stable Electromyography classification using Canonical Correlation Analysis. (arXiv:2301.09729v1 [cs.LG])
    Discrimination of hand gestures based on the decoding of surface electromyography (sEMG) signals is a well-establish approach for controlling prosthetic devices and for Human-Machine Interfaces (HMI). However, despite the promising results achieved by this approach in well-controlled experimental conditions, its deployment in long-term real-world application scenarios is still hindered by several challenges. One of the most critical challenges is maintaining high EMG data classification performance across multiple days without retraining the decoding system. The drop in performance is mostly due to the high EMG variability caused by electrodes shift, muscle artifacts, fatigue, user adaptation, or skin-electrode interfacing issues. Here we propose a novel statistical method based on canonical correlation analysis (CCA) that stabilizes EMG classification performance across multiple days for long-term control of prosthetic devices. We show how CCA can dramatically decrease the performance drop of standard classifiers observed across days, by maximizing the correlation among multiple-day acquisition data sets. Our results show how the performance of a classifier trained on EMG data acquired only of the first day of the experiment maintains 90% relative accuracy across multiple days, compensating for the EMG data variability that occurs over long-term periods, using the CCA transformation on data obtained from a small number of gestures. This approach eliminates the need for large data sets and multiple or periodic training sessions, which currently hamper the usability of conventional pattern recognition based approaches  ( 2 min )
    Earthquake Magnitude and b value prediction model using Extreme Learning Machine. (arXiv:2301.09756v1 [physics.geo-ph])
    Earthquake prediction has been a challenging research area for many decades, where the future occurrence of this highly uncertain calamity is predicted. In this paper, several parametric and non-parametric features were calculated, where the non-parametric features were calculated using the parametric features. $8$ seismic features were calculated using Gutenberg-Richter law, the total recurrence, and the seismic energy release. Additionally, criterions such as Maximum Relevance and Maximum Redundancy were applied to choose the pertinent features. These features along with others were used as input for an Extreme Learning Machine (ELM) Regression Model. Magnitude and time data of $5$ decades from the Assam-Guwahati region were used to create this model for magnitude prediction. The Testing Accuracy and Testing Speed were computed taking the Root Mean Squared Error (RMSE) as the parameter for evaluating the mode. As confirmed by the results, ELM shows better scalability with much faster training and testing speed (up to a thousand times faster) than traditional Support Vector Machines. The testing RMSE came out to be around $0.097$. To further test the model's robustness -- magnitude-time data from California was used to calculate the seismic indicators which were then fed into an ELM and then tested on the Assam-Guwahati region. The model proves to be robust and can be implemented in early warning systems as it continues to be a major part of Disaster Response and management.  ( 2 min )
    Two-Stage Learning For the Flexible Job Shop Scheduling Problem. (arXiv:2301.09703v1 [cs.AI])
    The Flexible Job-shop Scheduling Problem (FJSP) is an important combinatorial optimization problem that arises in manufacturing and service settings. FJSP is composed of two subproblems, an assignment problem that assigns tasks to machines, and a scheduling problem that determines the starting times of tasks on their chosen machines. Solving FJSP instances of realistic size and composition is an ongoing challenge even under simplified, deterministic assumptions. Motivated by the inevitable randomness and uncertainties in supply chains, manufacturing, and service operations, this paper investigates the potential of using a deep learning framework to generate fast and accurate approximations for FJSP. In particular, this paper proposes a two-stage learning framework 2SLFJSP that explicitly models the hierarchical nature of FJSP decisions, uses a confidence-aware branching scheme to generate appropriate instances for the scheduling stage from the assignment predictions and leverages a novel symmetry-breaking formulation to improve learnability. 2SL-FJSP is evaluated on instances from the FJSP benchmark library. Results show that 2SL-FJSP can generate high-quality solutions in milliseconds, outperforming a state-of-the-art reinforcement learning approach recently proposed in the literature, and other heuristics commonly used in practice.  ( 2 min )
    Implementation of the Critical Wave Groups Method with Computational Fluid Dynamics and Neural Networks. (arXiv:2301.09834v1 [physics.flu-dyn])
    Accurate and efficient prediction of extreme ship responses continues to be a challenging problem in ship hydrodynamics. Probabilistic frameworks in conjunction with computationally efficient numerical hydrodynamic tools have been developed that allow researchers and designers to better understand extremes. However, the ability of these hydrodynamic tools to represent the physics quantitatively during extreme events is limited. Previous research successfully implemented the critical wave groups (CWG) probabilistic method with computational fluid dynamics (CFD). Although the CWG method allows for less simulation time than a Monte Carlo approach, the large quantity of simulations required is cost prohibitive. The objective of the present paper is to reduce the computational cost of implementing CWG with CFD, through the construction of long short-term memory (LSTM) neural networks. After training the models with a limited quantity of simulations, the models can provide a larger quantity of predictions to calculate the probability. The new framework is demonstrated with a 2-D midship section of the Office of Naval Research Tumblehome (ONRT) hull in Sea State 7 and beam seas at zero speed. The new framework is able to produce predictions that are representative of a purely CFD-driven CWG framework, with two orders of magnitude of computational cost savings.  ( 2 min )
    Long-tail Detection with Effective Class-Margins. (arXiv:2301.09724v1 [cs.CV])
    Large-scale object detection and instance segmentation face a severe data imbalance. The finer-grained object classes become, the less frequent they appear in our datasets. However, at test-time, we expect a detector that performs well for all classes and not just the most frequent ones. In this paper, we provide a theoretical understanding of the long-trail detection problem. We show how the commonly used mean average precision evaluation metric on an unknown test set is bound by a margin-based binary classification error on a long-tailed object detection training set. We optimize margin-based binary classification error with a novel surrogate objective called \textbf{Effective Class-Margin Loss} (ECM). The ECM loss is simple, theoretically well-motivated, and outperforms other heuristic counterparts on LVIS v1 benchmark over a wide range of architecture and detectors. Code is available at \url{https://github.com/janghyuncho/ECM-Loss}.  ( 2 min )
    Graph Neural Networks for Decentralized Multi-Agent Perimeter Defense. (arXiv:2301.09689v1 [cs.MA])
    In this work, we study the problem of decentralized multi-agent perimeter defense that asks for computing actions for defenders with local perceptions and communications to maximize the capture of intruders. One major challenge for practical implementations is to make perimeter defense strategies scalable for large-scale problem instances. To this end, we leverage graph neural networks (GNNs) to develop an imitation learning framework that learns a mapping from defenders' local perceptions and their communication graph to their actions. The proposed GNN-based learning network is trained by imitating a centralized expert algorithm such that the learned actions are close to that generated by the expert algorithm. We demonstrate that our proposed network performs closer to the expert algorithm and is superior to other baseline algorithms by capturing more intruders. Our GNN-based network is trained at a small scale and can be generalized to large-scale cases. We run perimeter defense games in scenarios with different team sizes and configurations to demonstrate the performance of the learned network.  ( 2 min )
    Illumination Variation Correction Using Image Synthesis For Unsupervised Domain Adaptive Person Re-Identification. (arXiv:2301.09702v1 [eess.IV])
    Unsupervised domain adaptive (UDA) person re-identification (re-ID) aims to learn identity information from labeled images in source domains and apply it to unlabeled images in a target domain. One major issue with many unsupervised re-identification methods is that they do not perform well relative to large domain variations such as illumination, viewpoint, and occlusions. In this paper, we propose a Synthesis Model Bank (SMB) to deal with illumination variation in unsupervised person re-ID. The proposed SMB consists of several convolutional neural networks (CNN) for feature extraction and Mahalanobis matrices for distance metrics. They are trained using synthetic data with different illumination conditions such that their synergistic effect makes the SMB robust against illumination variation. To better quantify the illumination intensity and improve the quality of synthetic images, we introduce a new 3D virtual-human dataset for GAN-based image synthesis. From our experiments, the proposed SMB outperforms other synthesis methods on several re-ID benchmarks.  ( 2 min )
    PRIMEQA: The Prime Repository for State-of-the-Art MultilingualQuestion Answering Research and Development. (arXiv:2301.09715v1 [cs.CL])
    The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PRIMEQA: a one-stop and open-source QA repository with an aim to democratize QA re-search and facilitate easy replication of state-of-the-art (SOTA) QA methods. PRIMEQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation.It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on pub-lic benchmarks, and expanding pre-existing methods. PRIMEQA is available at : https://github.com/primeqa.  ( 2 min )
    On The Convergence Of Policy Iteration-Based Reinforcement Learning With Monte Carlo Policy Evaluation. (arXiv:2301.09709v1 [cs.LG])
    A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.  ( 2 min )
    Weakly-Supervised Questions for Zero-Shot Relation Extraction. (arXiv:2301.09640v1 [cs.CL])
    Zero-Shot Relation Extraction (ZRE) is the task of Relation Extraction where the training and test sets have no shared relation types. This very challenging domain is a good test of a model's ability to generalize. Previous approaches to ZRE reframed relation extraction as Question Answering (QA), allowing for the use of pre-trained QA models. However, this method required manually creating gold question templates for each new relation. Here, we do away with these gold templates and instead learn a model that can generate questions for unseen relations. Our technique can successfully translate relation descriptions into relevant questions, which are then leveraged to generate the correct tail entity. On tail entity extraction, we outperform the previous state-of-the-art by more than 16 F1 points without using gold question templates. On the RE-QA dataset where no previous baseline for relation extraction exists, our proposed algorithm comes within 0.7 F1 points of a system that uses gold question templates. Our model also outperforms the state-of-the-art ZRE baselines on the FewRel and WikiZSL datasets, showing that QA models no longer need template questions to match the performance of models specifically tailored to the ZRE task. Our implementation is available at https://github.com/fyshelab/QA-ZRE.  ( 2 min )
    Selective Explanations: Leveraging Human Input to Align Explainable AI. (arXiv:2301.09656v1 [cs.AI])
    While a vast collection of explainable AI (XAI) algorithms have been developed in recent years, they are often criticized for significant gaps with how humans produce and consume explanations. As a result, current XAI techniques are often found to be hard to use and lack effectiveness. In this work, we attempt to close these gaps by making AI explanations selective -- a fundamental property of human explanations -- by selectively presenting a subset from a large set of model reasons based on what aligns with the recipient's preferences. We propose a general framework for generating selective explanations by leveraging human input on a small sample. This framework opens up a rich design space that accounts for different selectivity goals, types of input, and more. As a showcase, we use a decision-support task to explore selective explanations based on what the decision-maker would consider relevant to the decision task. We conducted two experimental studies to examine three out of a broader possible set of paradigms based on our proposed framework: in Study 1, we ask the participants to provide their own input to generate selective explanations, with either open-ended or critique-based input. In Study 2, we show participants selective explanations based on input from a panel of similar users (annotators). Our experiments demonstrate the promise of selective explanations in reducing over-reliance on AI and improving decision outcomes and subjective perceptions of the AI, but also paint a nuanced picture that attributes some of these positive effects to the opportunity to provide one's own input to augment AI explanations. Overall, our work proposes a novel XAI framework inspired by human communication behaviors and demonstrates its potentials to encourage future work to better align AI explanations with human production and consumption of explanations.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints. (arXiv:2301.09642v1 [q-bio.QM])
    Have you ever been troubled by the complexity and computational cost of SE(3) protein structure modeling and been amazed by the simplicity and power of language modeling? Recent work has shown promise in simplifying protein structures as sequences of protein angles; therefore, language models could be used for unconstrained protein backbone generation. Unfortunately, such simplification is unsuitable for the constrained protein inpainting problem, where the model needs to recover masked structures conditioned on unmasked ones, as it dramatically increases the computing cost of geometric constraints. To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic \textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model, converting invariant backbone angles into equivalent direction vectors and preserving the simplicity, called Seq2Direct encoder ($\text{Enc}_{s2d}$). Geometric constraints could be efficiently imposed on the newly introduced direction space. A Direct2Seq decoder ($\text{Dec}_{d2s}$) with mathematical guarantees is also introduced to develop a \textbf{SDS} ($\text{Enc}_{s2d}$+$\text{Dec}_{d2s}$) model. We apply the SDS model as the denoising neural network during the conditional diffusion process, resulting in a constrained generative model--\textbf{DiffSDS}. Extensive experiments show that the plug-and-play ADS could transform the language model into a strong structural model without loss of simplicity. More importantly, the proposed DiffSDS outperforms previous strong baselines by a large margin on the task of protein inpainting.  ( 2 min )
  • Open

    Incorporating functional summary information in Bayesian neural networks using a Dirichlet process likelihood approach. (arXiv:2207.01234v2 [cs.LG] UPDATED)
    Bayesian neural networks (BNNs) can account for both aleatoric and epistemic uncertainty. However, in BNNs the priors are often specified over the weights which rarely reflects true prior knowledge in large and complex neural network architectures. We present a simple approach to incorporate prior knowledge in BNNs based on external summary information about the predicted classification probabilities for a given dataset. The available summary information is incorporated as augmented data and modeled with a Dirichlet process, and we derive the corresponding \emph{Summary Evidence Lower BOund}. The approach is founded on Bayesian principles, and all hyperparameters have a proper probabilistic interpretation. We show how the method can inform the model about task difficulty and class imbalance. Extensive experiments show that, with negligible computational overhead, our method parallels and in many cases outperforms popular alternatives in accuracy, uncertainty calibration, and robustness against corruptions with both balanced and imbalanced data.  ( 2 min )
    Multiway Spherical Clustering via Degree-Corrected Tensor Block Models. (arXiv:2201.07401v2 [math.ST] UPDATED)
    We consider the problem of multiway clustering in the presence of unknown degree heterogeneity. Such data problems arise commonly in applications such as recommendation system, neuroimaging, community detection, and hypergraph partitions in social networks. The allowance of degree heterogeneity provides great flexibility in clustering models, but the extra complexity poses significant challenges in both statistics and computation. Here, we develop a degree-corrected tensor block model with estimation accuracy guarantees. We present the phase transition of clustering performance based on the notion of angle separability, and we characterize three signal-to-noise regimes corresponding to different statistical-computational behaviors. In particular, we demonstrate that an intrinsic statistical-to-computational gap emerges only for tensors of order three or greater. Further, we develop an efficient polynomial-time algorithm that provably achieves exact clustering under mild signal conditions. The efficacy of our procedure is demonstrated through two data applications, one on human brain connectome project, and another on Peru Legislation network dataset.  ( 2 min )
    Proportional Fairness in Federated Learning. (arXiv:2202.01666v3 [cs.LG] UPDATED)
    With the increasingly broad deployment of federated learning (FL) systems in the real world, it is critical but challenging to ensure fairness in FL, i.e. reasonably satisfactory performances for each of the numerous diverse clients. In this work, we introduce and study a new fairness notion in FL, called proportional fairness (PF), which is based on the relative change of each client's performance. From its connection with the bargaining games, we propose PropFair, a novel and easy-to-implement algorithm for finding proportionally fair solutions in FL and study its convergence properties. Through extensive experiments on vision and language datasets, we demonstrate that PropFair can approximately find PF solutions, and it achieves a good balance between the average performances of all clients and of the worst 10% clients.  ( 2 min )
    Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles. (arXiv:2206.02088v2 [stat.ML] UPDATED)
    To promote new scientific discoveries from complex data sets, feature importance inference has been a long-standing statistical problem. Instead of testing for parameters that are only interpretable for specific models, there has been increasing interest in model-agnostic methods, often in the form of feature occlusion or leave-one-covariate-out (LOCO) inference. Existing approaches often make distributional assumptions, which can be difficult to verify in practice, or require model refitting and data splitting, which are computationally intensive and lead to losses in power. In this work, we develop a novel, mostly model-agnostic and distribution-free inference framework for feature importance that is computationally efficient and statistically powerful. Our approach is fast as we avoid model refitting by leveraging a form of random observation and feature subsampling called minipatch ensembles; this approach also improves statistical power by avoiding data splitting. Our framework can be applied on tabular data and with any machine learning algorithm, together with minipatch ensembles, for regression and classification tasks. Despite the dependencies induced by using minipatch ensembles, we show that our approach provides asymptotic coverage for the feature importance score of any model under mild assumptions. Finally, our same procedure can also be leveraged to provide valid confidence intervals for predictions, hence providing fast, simultaneous quantification of the uncertainty of both predictions and feature importance. We validate our intervals on a series of synthetic and real data examples, including non-linear settings, showing that our approach detects the correct important features and exhibits many computational and statistical advantages over existing methods.  ( 2 min )
    A Wholistic View of Continual Learning with Deep Neural Networks: Forgotten Lessons and the Bridge to Active and Open World Learning. (arXiv:2009.01797v3 [cs.LG] UPDATED)
    Current deep learning methods are regarded as favorable if they empirically perform well on dedicated test sets. This mentality is seamlessly reflected in the resurfacing area of continual learning, where consecutively arriving data is investigated. The core challenge is framed as protecting previously acquired representations from being catastrophically forgotten. However, comparison of individual methods is nevertheless performed in isolation from the real world by monitoring accumulated benchmark test set performance. The closed world assumption remains predominant, i.e. models are evaluated on data that is guaranteed to originate from the same distribution as used for training. This poses a massive challenge as neural networks are well known to provide overconfident false predictions on unknown and corrupted instances. In this work we critically survey the literature and argue that notable lessons from open set recognition, identifying unknown examples outside of the observed set, and the adjacent field of active learning, querying data to maximize the expected performance gain, are frequently overlooked in the deep learning era. Hence, we propose a consolidated view to bridge continual learning, active learning and open set recognition in deep neural networks. Finally, the established synergies are supported empirically, showing joint improvement in alleviating catastrophic forgetting, querying data, selecting task orders, while exhibiting robust open world application.  ( 2 min )
    Neyman-Pearson Multi-class Classification via Cost-sensitive Learning. (arXiv:2111.04597v2 [stat.ML] UPDATED)
    Most existing classification methods aim to minimize the overall misclassification error rate. However, in applications, different types of errors can have different consequences. Two popular paradigms have been developed to account for this asymmetry issue: the Neyman-Pearson (NP) paradigm and the cost-sensitive (CS) paradigm. Compared to the CS paradigm, the NP paradigm does not require a specification of costs. Most previous works on the NP paradigm focused on the binary case. In this work, we study the multi-class NP problem by connecting it to the CS problem and propose two algorithms. We extend the NP oracle inequalities and consistency from the binary case to the multi-class case, showing that our two algorithms enjoy these properties under certain conditions. The simulation and real data studies demonstrate the effectiveness of our algorithms. To our knowledge, this is the first work to solve the multi-class NP problem via cost-sensitive learning techniques with theoretical guarantees. The proposed algorithms are implemented in the R package npcs on CRAN.  ( 2 min )
    Concentration Inequalities for Two-Sample Rank Processes with Application to Bipartite Ranking. (arXiv:2104.02943v3 [math.ST] UPDATED)
    The ROC curve is the gold standard for measuring the performance of a test/scoring statistic regarding its capacity to discriminate between two statistical populations in a wide variety of applications, ranging from anomaly detection in signal processing to information retrieval, through medical diagnosis. Most practical performance measures used in scoring/ranking applications such as the AUC, the local AUC, the p-norm push, the DCG and others, can be viewed as summaries of the ROC curve. In this paper, the fact that most of these empirical criteria can be expressed as two-sample linear rank statistics is highlighted and concentration inequalities for collections of such random variables, referred to as two-sample rank processes here, are proved, when indexed by VC classes of scoring functions. Based on these nonasymptotic bounds, the generalization capacity of empirical maximizers of a wide class of ranking performance criteria is next investigated from a theoretical perspective. It is also supported by empirical evidence through convincing numerical experiments.  ( 2 min )
    Upper and Lower Bounds on the Performance of Kernel PCA. (arXiv:2012.10369v2 [cs.LG] UPDATED)
    Principal Component Analysis (PCA) is a popular method for dimension reduction and has attracted an unfailing interest for decades. More recently, kernel PCA (KPCA) has emerged as an extension of PCA but, despite its use in practice, a sound theoretical understanding of KPCA is missing. We contribute several lower and upper bounds on the efficiency of KPCA, involving the empirical eigenvalues of the kernel Gram matrix and new quantities involving a notion of variance. These bounds show how much information is captured by KPCA on average and contribute a better theoretical understanding of its efficiency. We demonstrate that fast convergence rates are achievable for a widely used class of kernels and we highlight the importance of some desirable properties of datasets to ensure KPCA efficiency.  ( 2 min )
    Federated Learning Meets Multi-objective Optimization. (arXiv:2006.11489v2 [cs.LG] UPDATED)
    Federated learning has emerged as a promising, massively distributed way to train a joint deep model over large amounts of edge devices while keeping private user data strictly on device. In this work, motivated from ensuring fairness among users and robustness against malicious adversaries, we formulate federated learning as multi-objective optimization and propose a new algorithm FedMGDA+ that is guaranteed to converge to Pareto stationary solutions. FedMGDA+ is simple to implement, has fewer hyperparameters to tune, and refrains from sacrificing the performance of any participating user. We establish the convergence properties of FedMGDA+ and point out its connections to existing approaches. Extensive experiments on a variety of datasets confirm that FedMGDA+ compares favorably against state-of-the-art.  ( 2 min )
    Improving Open-Set Semi-Supervised Learning with Self-Supervision. (arXiv:2301.10127v1 [cs.LG])
    Open-set semi-supervised learning (OSSL) is a realistic setting of semi-supervised learning where the unlabeled training set contains classes that are not present in the labeled set. Many existing OSSL methods assume that these out-of-distribution data are harmful and put effort into excluding data from unknown classes from the training objective. In contrast, we propose an OSSL framework that facilitates learning from all unlabeled data through self-supervision. Additionally, we utilize an energy-based score to accurately recognize data belonging to the known classes, making our method well-suited for handling uncurated data in deployment. We show through extensive experimental evaluations on several datasets that our method shows overall unmatched robustness and performance in terms of closed-set accuracy and open-set recognition compared with state-of-the-art for OSSL. Our code will be released upon publication.  ( 2 min )
    Double Matching Under Complementary Preferences. (arXiv:2301.10230v1 [stat.ML])
    In this paper, we propose a new algorithm for addressing the problem of matching markets with complementary preferences, where agents' preferences are unknown a priori and must be learned from data. The presence of complementary preferences can lead to instability in the matching process, making this problem challenging to solve. To overcome this challenge, we formulate the problem as a bandit learning framework and propose the Multi-agent Multi-type Thompson Sampling (MMTS) algorithm. The algorithm combines the strengths of Thompson Sampling for exploration with a double matching technique to achieve a stable matching outcome. Our theoretical analysis demonstrates the effectiveness of MMTS as it is able to achieve stability at every matching step, satisfies the incentive-compatibility property, and has a sublinear Bayesian regret over time. Our approach provides a useful method for addressing complementary preferences in real-world scenarios.  ( 2 min )
    How Jellyfish Characterise Alternating Group Equivariant Neural Networks. (arXiv:2301.10152v1 [cs.LG])
    We provide a full characterisation of all of the possible alternating group ($A_n$) equivariant neural networks whose layers are some tensor power of $\mathbb{R}^{n}$. In particular, we find a basis of matrices for the learnable, linear, $A_n$-equivariant layer functions between such tensor power spaces in the standard basis of $\mathbb{R}^{n}$. We also describe how our approach generalises to the construction of neural networks that are equivariant to local symmetries.  ( 2 min )
    A Robust Hypothesis Test for Tree Ensemble Pruning. (arXiv:2301.10115v1 [cs.LG])
    Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.  ( 2 min )
    Inducing Point Allocation for Sparse Gaussian Processes in High-Throughput Bayesian Optimisation. (arXiv:2301.10123v1 [cs.LG])
    Sparse Gaussian Processes are a key component of high-throughput Bayesian Optimisation (BO) loops; however, we show that existing methods for allocating their inducing points severely hamper optimisation performance. By exploiting the quality-diversity decomposition of Determinantal Point Processes, we propose the first inducing point allocation strategy designed specifically for use in BO. Unlike existing methods which seek only to reduce global uncertainty in the objective function, our approach provides the local high-fidelity modelling of promising regions required for precise optimisation. More generally, we demonstrate that our proposed framework provides a flexible way to allocate modelling capacity in sparse models and so is suitable broad range of downstream sequential decision making tasks.  ( 2 min )
    Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a Neural Point Process. (arXiv:2301.09948v1 [physics.geo-ph])
    Point processes have been dominant in modeling the evolution of seismicity for decades, with the Epidemic Type Aftershock Sequence (ETAS) model being most popular. Recent advances in machine learning have constructed highly flexible point process models using neural networks to improve upon existing parametric models. We investigate whether these flexible point process models can be applied to short-term seismicity forecasting by extending an existing temporal neural model to the magnitude domain and we show how this model can forecast earthquakes above a target magnitude threshold. We first demonstrate that the neural model can fit synthetic ETAS data, however, requiring less computational time because it is not dependent on the full history of the sequence. By artificially emulating short-term aftershock incompleteness in the synthetic dataset, we find that the neural model outperforms ETAS. Using a new enhanced catalog from the 2016-2017 Central Apennines earthquake sequence, we investigate the predictive skill of ETAS and the neural model with respect to the lowest input magnitude. Constructing multiple forecasting experiments using the Visso, Norcia and Campotosto earthquakes to partition training and testing data, we target M3+ events. We find both models perform similarly at previously explored thresholds (e.g., above M3), but lowering the threshold to M1.2 reduces the performance of ETAS unlike the neural model. We argue that some of these gains are due to the neural model's ability to handle incomplete data. The robustness to missing data and speed to train the neural model present it as an encouraging competitor in earthquake forecasting.  ( 2 min )
    Adaptive Probabilistic Forecasting of Electricity (Net-)Load. (arXiv:2301.10090v1 [stat.AP])
    We focus on electricity load forecasting under three important specificities. First, our setting is adaptive; we use models taking into account the most recent observations available, yielding a forecasting strategy able to automatically respond to regime changes. Second, we consider probabilistic rather than point forecasting; indeed, uncertainty quantification is required to operate electricity systems efficiently and reliably. Third, we consider both conventional load (consumption only) and netload (consumption less embedded generation). Our methodology relies on the Kalman filter, previously used successfully for adaptive point load forecasting. The probabilistic forecasts are obtained by quantile regressions on the residuals of the point forecasting model. We achieve adaptive quantile regressions using the online gradient descent; we avoid the choice of the gradient step size considering multiple learning rates and aggregation of experts. We apply the method to two data sets: the regional net-load in Great Britain and the demand of seven large cities in the United States. Adaptive procedures improve forecast performance substantially in both use cases and for both point and probabilistic forecasting.  ( 2 min )
    Heterogeneous Domain Adaptation for IoT Intrusion Detection: A Geometric Graph Alignment Approach. (arXiv:2301.09801v1 [cs.CR])
    Data scarcity hinders the usability of data-dependent algorithms when tackling IoT intrusion detection (IID). To address this, we utilise the data rich network intrusion detection (NID) domain to facilitate more accurate intrusion detection for IID domains. In this paper, a Geometric Graph Alignment (GGA) approach is leveraged to mask the geometric heterogeneities between domains for better intrusion knowledge transfer. Specifically, each intrusion domain is formulated as a graph where vertices and edges represent intrusion categories and category-wise interrelationships, respectively. The overall shape is preserved via a confused discriminator incapable to identify adjacency matrices between different intrusion domain graphs. A rotation avoidance mechanism and a centre point matching mechanism is used to avoid graph misalignment due to rotation and symmetry, respectively. Besides, category-wise semantic knowledge is transferred to act as vertex-level alignment. To exploit the target data, a pseudo-label election mechanism that jointly considers network prediction, geometric property and neighbourhood information is used to produce fine-grained pseudo-label assignment. Upon aligning the intrusion graphs geometrically from different granularities, the transferred intrusion knowledge can boost IID performance. Comprehensive experiments on several intrusion datasets demonstrate state-of-the-art performance of the GGA approach and validate the usefulness of GGA constituting components.  ( 2 min )
    Context-specific kernel-based hidden Markov model for time series analysis. (arXiv:2301.09870v1 [stat.ML])
    Traditional hidden Markov models have been a useful tool to understand and model stochastic dynamic linear data; in the case of non-Gaussian data or not linear in mean data, models such as mixture of Gaussian hidden Markov models suffer from the computation of precision matrices and have a lot of unnecessary parameters. As a consequence, such models often perform better when it is assumed that all variables are independent, a hypothesis that may be unrealistic. Hidden Markov models based on kernel density estimation is also capable of modeling non Gaussian data, but they assume independence between variables. In this article, we introduce a new hidden Markov model based on kernel density estimation, which is capable of introducing kernel dependencies using context-specific Bayesian networks. The proposed model is described, together with a learning algorithm based on the expectation-maximization algorithm. Additionally, the model is compared with related HMMs using synthetic and real data. From the results, the benefits in likelihood and classification accuracy from the proposed model are quantified and analyzed.  ( 2 min )
    On Dynamic Regret and Constraint Violations in Constrained Online Convex Optimization. (arXiv:2301.09808v1 [cs.LG])
    A constrained version of the online convex optimization (OCO) problem is considered. With slotted time, for each slot, first an action is chosen. Subsequently the loss function and the constraint violation penalty evaluated at the chosen action point is revealed. For each slot, both the loss function as well as the function defining the constraint set is assumed to be smooth and strongly convex. In addition, once an action is chosen, local information about a feasible set within a small neighborhood of the current action is also revealed. An algorithm is allowed to compute at most one gradient at its point of choice given the described feedback to choose the next action. The goal of an algorithm is to simultaneously minimize the dynamic regret (loss incurred compared to the oracle's loss) and the constraint violation penalty (penalty accrued compared to the oracle's penalty). We propose an algorithm that follows projected gradient descent over a suitably chosen set around the current action. We show that both the dynamic regret and the constraint violation is order-wise bounded by the {\it path-length}, the sum of the distances between the consecutive optimal actions. Moreover, we show that the derived bounds are the best possible.  ( 2 min )
    Optimizing the Noise in Self-Supervised Learning: from Importance Sampling to Noise-Contrastive Estimation. (arXiv:2301.09696v1 [stat.ML])
    Self-supervised learning is an increasingly popular approach to unsupervised learning, achieving state-of-the-art results. A prevalent approach consists in contrasting data points and noise points within a classification task: this requires a good noise distribution which is notoriously hard to specify. While a comprehensive theory is missing, it is widely assumed that the optimal noise distribution should in practice be made equal to the data distribution, as in Generative Adversarial Networks (GANs). We here empirically and theoretically challenge this assumption. We turn to Noise-Contrastive Estimation (NCE) which grounds this self-supervised task as an estimation problem of an energy-based model of the data. This ties the optimality of the noise distribution to the sample efficiency of the estimator, which is rigorously defined as its asymptotic variance, or mean-squared error. In the special case where the normalization constant only is unknown, we show that NCE recovers a family of Importance Sampling estimators for which the optimal noise is indeed equal to the data distribution. However, in the general case where the energy is also unknown, we prove that the optimal noise density is the data density multiplied by a correction term based on the Fisher score. In particular, the optimal noise distribution is different from the data distribution, and is even from a different family. Nevertheless, we soberly conclude that the optimal noise may be hard to sample from, and the gain in efficiency can be modest compared to choosing the noise distribution equal to the data's.  ( 2 min )
    Flexible conditional density estimation for time series. (arXiv:2301.09671v1 [stat.ME])
    This paper introduces FlexCodeTS, a new conditional density estimator for time series. FlexCodeTS is a flexible nonparametric conditional density estimator, which can be based on an arbitrary regression method. It is shown that FlexCodeTS inherits the rate of convergence of the chosen regression method. Hence, FlexCodeTS can adapt its convergence by employing the regression method that best fits the structure of data. From an empirical perspective, FlexCodeTS is compared to NNKCDE and GARCH in both simulated and real data. FlexCodeTS is shown to generally obtain the best performance among the selected methods according to either the CDE loss or the pinball loss.  ( 2 min )
    Improved Rate of First Order Algorithms for Entropic Optimal Transport. (arXiv:2301.09675v1 [math.OC])
    This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. The resulting rate for approximating the optimal transport (OT) has been improved from $\widetilde{{O}}({n^{2.5}}/{\epsilon})$ to $\widetilde{{O}}({n^2}/{\epsilon})$, where $n$ is the problem size and $\epsilon$ is the accuracy level. In particular, we propose an accelerated primal-dual stochastic mirror descent algorithm with variance reduction. Such special design helps us improve the rate compared to other accelerated primal-dual algorithms. We further propose a batch version of our stochastic algorithm, which improves the computational performance through parallel computing. To compare, we prove that the computational complexity of the Stochastic Sinkhorn algorithm is $\widetilde{{O}}({n^2}/{\epsilon^2})$, which is slower than our accelerated primal-dual stochastic mirror algorithm. Experiments are done using synthetic and real data, and the results match our theoretical rates. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $\widetilde{{O}}({n^2}/{\epsilon})$ for solving OT.  ( 2 min )

  • Open

    Just found a new chrome extension called IntelliMail that uses AI to write emails. Its super easy to use and can be used to land internships, jobs and up your email game.
    submitted by /u/bobsandalex [link] [comments]  ( 40 min )
    I created an 'AI Tools' series on YouTube and I'd love your feedback! Today is About Jasper AI
    I would love to know your opinion about what I should improve on mt videos and of course if you don't know about Jasper AI give a look, it's a great AI tool for Content Creation https://youtu.be/x_6rzsBVABg submitted by /u/sigmabruuh [link] [comments]  ( 6 min )
    OpenAi's breakthrough
    https://twitter.com/make_mhe/status/1618255363580755968 submitted by /u/bradasm [link] [comments]  ( 40 min )
    I asked an AI image generator to show me a "typical Discord user"
    submitted by /u/MSAPW [link] [comments]  ( 40 min )
    AI Dream 150 - SUPERNOVA IMMINENT Part1 TEASER - AI Video vqgan clip
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    I am recommending you something
    It's an amazing app called Simplywrite. It gives you 20 credits for free. It lets you generate articles and outlines. Totally a free tool to use when you bored. LINK TO APP: https://simplywrite.com submitted by /u/benignkirby [link] [comments]  ( 40 min )
    Anonymize tabular data to meet GDPR privacy requirements
    submitted by /u/Repeat-or [link] [comments]  ( 40 min )
    I made a list of tools powered by AI
    submitted by /u/Alen0tv [link] [comments]  ( 40 min )
    How to automatically generate test cases in NLP?
    submitted by /u/Nazma2015 [link] [comments]  ( 40 min )
    Best TTS for cheesy movie trailer guy voice?
    Looking for a TTS that can generate that really deep movie trailer male voice. Any ideas? submitted by /u/infinitycurvature [link] [comments]  ( 40 min )
    Best video style transfer program as of now?
    Looking something where you can capture a face speaking and replace it with a completely different character (including neck, face, shoulders). Looking to replace a human with an animated character. What options are there as of now? I know ebsynth, but it's not too good. There is vtuber software, but they look too "anime". Looking for something that would give a very "photorealistic" result, think Avatar quality output or a Pixar movie. What is the best that's out there right now? submitted by /u/UpperStruggle2421 [link] [comments]  ( 41 min )
    Beware Loab, the digital cryptid lurking in AI's forgotten space
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    Will coders and writers just be doing QA of AI output?
    So 5 years ago, the main career and hobby advice I had was "look to see if programming is a thing for you, it's at least a fun hobby and maybe you can find a niche as one of the last of the custom craftspeople" But now... where do you go, in a society where wealth tends to get sucked upward, to find a good living? Do you learn to roll with being a bot manager? (Assuming the AIs don't just manage themselves?) Do you look for trades of manipulating physical stuff that haven't yet lent themselves to cybernetics? submitted by /u/kirk_is_ [link] [comments]  ( 44 min )
    merge photo with ai
    hi everyone now i have two different photos (not image) and how can i merge thats with using ai. submitted by /u/Aigerim_D40 [link] [comments]  ( 40 min )
    What is your preferred Image Generation API / App?
    It is really difficult to benchmark Text-to-Image AIs, it relies on so many aspects: speed, styles, precision of the prompt, interface, fine-tuning, etc. So I think the best approach is to see which are the most prefered by the people who use Stable Diffusion API. Do not hesitate to explain your choice in comments, and also mention APIs that are not in the Poll, I am limited to 6 options... I know that I did not put Midjourney, Artbreeder, Stable Diffusion, NightCafe, Crayion, Starry AI and many other but I am interested in those which provide API only. PS: this isn't promotional at all, I am not working for any of those companies. View Poll submitted by /u/JerLam2762 [link] [comments]  ( 42 min )
    I asked an AI image generator to show me a "typical Reddit user"
    submitted by /u/MrsChenHW [link] [comments]  ( 41 min )
    Yann LeCun, Meta’s Chief AI Scientist, Has Some Harsh Criticism Of ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    A New Free AI Model InstructPix2Pix To Transform Images By Plain English Instructions
    submitted by /u/CeFurkan [link] [comments]  ( 40 min )
    Fearing ChatGPT, Google enlists founders Brin and Page in AI fight
    submitted by /u/SAT0725 [link] [comments]  ( 40 min )
    Being really humorous under the pressure of billions of prompt requests
    submitted by /u/Imagine-your-success [link] [comments]  ( 41 min )
    Humanity's Quest to Decode Animal Languages Through AI
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    The Connection Between Science Fiction and Artificial Intelligence: A Survey Study
    Hello everyone, I am a student in AP Research. For my project, I am conducting a survey to analyze the connection between science fiction and technology (specifically Artificial Intelligence). This survey (linked) asks a few questions about your knowledge of Sci-fi, Artificial Intelligence, and the connection between the two. It should not take more than 10 minutes of your time. If you are interested, the link to the form is below: https://docs.google.com/forms/d/e/1FAIpQLScY_VaNI-CEtTiJiLHgYCCguEZ7m9DUdQoxvFTjXFFLOGu2KA/viewform If you have additional questions, my email is in the linked google form. Thank you for your participation, it is deeply appreciated! submitted by /u/rsantos05 [link] [comments]  ( 41 min )
  • Open

    [D] Alphatensor benchmark code in Colab
    Hello everybody I was wondering if anybody tried to run the main factorisation code https://github.com/deepmind/alphatensor/blob/main/benchmarking/factorizations.py from alpha tensor on Google Colab, with Colab's GPUs ( Tesla T4). I know that Tesla T4 is not as the same as the V100 used in Deep Mind's paper, however, I can see that the tensor formulation for the matrix multiplication is highly inefficient, compared to standard JAX matrix multiplication. Any suggestion where am I wrong? submitted by /u/IndependentIce4553 [link] [comments]  ( 42 min )
    [N] Upcoming talk: "Open Problems in Deep Neural Networks: An Optimal Control Perspective"
    Open Problems in Deep Neural Networks: An Optimal Control Perspective Feb 13, 6:30 ET About the talk: Backpropagation is a widely used algorithm for training neural networks. Its key step, Stochastic Gradient Descent (SGD) has become one of the bedrocks of deep learning. Despite wide adoption, mathematically rigorous study of SGD's convergence for deep neural networks is still ongoing. Join us as graduate student Amoolya Tirumalai discusses an approach to the convergence problem inspired by optimal control theory. Following the Pontryargin maximum principle, an alternative forward-backward iterative system will be described. Toy examples will be shown, and problems in robustness and security will be discussed. About the speaker: Amoolya Tirumalai is a 4th year PhD Candidate in Electrical Engineering at the University of Maryland, College Park. His interests are (robust) optimal control, partial differential equations, differential games, mean-field games, safety-critical control, and (robust) machine learning. His thesis titled 'Multi-agent inference, decision-making and control: models, structure and performance evaluation' will be defended in August 2023. Mr. Tirumalai was conferred a BS in Biomedical Engineering from the Georgia Institute of Technology in 2018. submitted by /u/what_comes_next [link] [comments]  ( 43 min )
    [D] Publication Resume
    If we submit a publication to ICML and it is under anonymous review, can I list the title and authors on my resume which will be on my personal webpage? submitted by /u/BigDreamx [link] [comments]  ( 43 min )
    [R] Blogpost on comparing Chatbots like ChatGPT, LaMDA, Sparrow, BlenderBot 3, and Claude
    https://huggingface.co/blog/dialog-agents breaks down the techniques behind ChatGPT -- instruction fine-tuning, supervised fine-tuning, chain-of-thought, read teaming, and more. https://preview.redd.it/fv16fsemd9ea1.png?width=889&format=png&auto=webp&s=a8f24de27c40a946fec64eaa674f81ddef0d0cc3 submitted by /u/emailnazneen [link] [comments]  ( 42 min )
    [D] Efficient retrieval of research information for graduate research
    I have lot of notes about research papers in a particular directory and the number of files has started to become larger than what I can remember off the top of my head. It will continue to keep growing and I have begun to wonder the most efficient way to retrieve the information. I could use ripgrep and regular expressions to find the notes efficiently, but I imagine that if the database is very huge and I don't have the correct regular expression in use, then I might not retrieve the correct files. Inspired by chatGPT, I was impressed at how it presents info from the internet and speeds up my time for finding information even when I do not know the correct keywords. I figured a NLP model primarily trained on my database would be an easier task and I was wondering if someone had already created something like this as open source or how would they go about it? submitted by /u/waterstrider123 [link] [comments]  ( 46 min )
    [D] Accurate data or more data?
    If you are building a model and had the choice, would you prefer more accurate (~99%) but less data or a lot more data but less accurate (~90%)? submitted by /u/NoSympathy9787 [link] [comments]  ( 43 min )
    [R] Best service for scientific paper correction
    Hello, Anyone ever used a paper revision service and can recommend one ? I’m publishing my first paper next month and I want to have feedback from an expert on this domain. Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 43 min )
    [D] Self-Supervised Contrastive Approaches that don’t use large batch size.
    This thread is dedicated to exploring the various techniques used in self-supervised contrastive learning that utilize standard batch sizes. I am seeking information on the current methods in this field, specifically those that do not rely on large batch sizes. I am familiar with the SimSiam paper published by META research, which utilizes 256 batch size for 8-GPUs. However, for individuals with limited resources such as myself, access to a large number of GPUs may not be feasible. As a result, I am interested in learning about other methods that can be used with smaller batch sizes and a single GPU, such as those that would be suitable for training on 1024x1024 input images. Additionally, I am curious about any more efficient architectures that have been developed in this field. This includes, but is not limited to, techniques used in natural language processing that may have applications in other areas of artificial intelligence. ***posted the same question in PyTorch forums, reposting here for wider reach. submitted by /u/shingekichan1996 [link] [comments]  ( 46 min )
    [R] Tsetlin Machine in Medical Research - Striking Differences Between Tsetlin Machine Interpretability and Deep Learning Attention
    ​ Tsetlin machine interpretability vs deep learning attention. Researchers at West China Hospital, Sichuan University, NORCE, and UiA have developed a Tsetlin machine-based architecture for premature ventricular contraction identification by analyzing long-term ECG signals. The experiments show that the Tsetlin machine is capable of producing human-interpretable rules, consistent with the clinical standard and medical knowledge. Simultaneously, the accuracy was comparable with deep CNN-based models. Paper: https://arxiv.org/abs/2301.10181 submitted by /u/olegranmo [link] [comments]  ( 43 min )
    [R] INSTRUCTOR One Embedder , Any Task: Instruction-Finetuned Text Embeddings Paper Explanation and Collab Demo
    In this video I explain about INSTRUCTOR, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor achieves sota on 70 diverse embedding tasks! I also show a google collab demo of instructor https://youtu.be/vg38cq3KJ6M submitted by /u/Sea-Photo5230 [link] [comments]  ( 42 min )
    [R] Easiest way to train RNN's in MATLAB or Julia?
    I work as as a researcher and am kind of new to neural networks. I have an RNN (1e4 x 1e4 network) that I would like to train in either MATLAB or Julia. One option I considered is writing my own code for Hessian-free optimization, but the implementational details are really, really hard to figure out. I am aware there is a Theano or TF implementation of HFO but I I am primarily interested in having the code in MATLAB/Julia. Also, are there better/alternative techniques than Hessian-free optimization for training RNN's ? submitted by /u/NadaBrothers [link] [comments]  ( 44 min )
    Can an AI model licensed under the BigScience RAIL License v1.0 such as BLOOM be used in a program that is useful for any domain? [D]
    Example: the AI model BLOOM) is licensed under the BigScience RAIL License v1.0. The BigScience RAIL License v1.0 forbids that some types of usages: You agree not to use the Model or Derivatives of the Model: [...] To provide medical advice and medical results interpretation; To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use). Am I allowed to use BLOOM in a program that is useful for any domain (e.g., a program to summarize or paraphrase some text, or perform question-answer on a text, or generate questions and their answers based on the text)? Since people could use the program for any domain, they could technically, for example, use the program to summarize a medical report or generate questions and their answers based on some asylum process to distribute to potential applicants. submitted by /u/Franck_Dernoncourt [link] [comments]  ( 44 min )
  • Open

    DM Control Suite vs. Original Environments
    I’m testing out DM control suite as I’d ideally like to do some stuff with the MuJoCo environments, e.g. Hopper. However, it seems as though they’ve changed Hopper from the OpenAI version? For instance, the action space is now 4-dimensional, and the bigger concern for me is that the reward seems to be specified differently. Per the gym documentation, the reward was healthy_reward + forward_reward - ctrl_cost, but when I’ve just started using the control suite version all rewards seem to be 0. The documentation for the control suite is quite poor, it says that for the hop task it is rewarded for torso heigh and forward velocity. It also doesn’t explain what the action dimensions correspond to (including the new dimension), so I can’t even manually test it! submitted by /u/DefinitelyNot4Burner [link] [comments]  ( 41 min )
    Does action masking reduce the ability of the agent to learn game rules?
    I recently experimented with training an sb3 PPO agent on a pretty complicated board game environment (just for fun). At first, I did regular PPO with an invalid action penalty, but it was making a lot of invalid moves and thus getting penalized and terminated early. It very slowly picked up on the signal and started to learn, but much too slowly to get any good results. After days of training, it could usually only play a handful of opening moves. On the other hand, I trained a Masked PPO in the same environment and it rapidly became quite good and was able to play relatively competitively after a few days of training. However, when I examined the outputs in an unmasked setting, it had little-to-no understanding of the game rules. It could still play OK but did not rank valid moves as the highest. This is a problem because I wanted to use it in a non-simulator setting without having to explicitly manually mask the moves by hand (or else convert a game state to a mask, both of which are tedious in my situation). Is this behavior expected? I have read some analyses that suggest that 1) MaskedPPO is much more sample efficient and should converge to a stronger agent MUCH faster, which makes sense, but also that 2) Even despite the invalid action masking, the agent should still learn game mechanics by proxy. If it's only being rewarded for making valid moves, it should learn to not make invalid moves implicitly since it never gets a reward signal for them (rather than being explicitly penalized). Thoughts? I only have a weak background in RL so apologies if this is naive. TLDR: Does action masking make the policy (or reward) network lazy? submitted by /u/TobusFire [link] [comments]  ( 44 min )
    Any List of Videogames With Reinforced Learning Agents Developed?
    Is there any list of videogames for which agents using reinforcement learning have been developed? Enquiring minds wanna know. submitted by /u/sanman [link] [comments]  ( 42 min )
    Learning to Exploit Elastic Actuators for Quadruped Locomotion
    submitted by /u/araffin2 [link] [comments]  ( 40 min )
    What is the limit on parallel environments?
    Is there some sort of hard or practical limit on the number of parallel environments that can be used? In Rllib when I try to use more 7 or 8 I get a scheduling error but yet I see people talking about 32 or 512 environments. What’s the limit? Is there some way for me to increase the amount I can train on? For example, my GPU seems under utilised but my CPU is very stressed, can I incrrrase GPU usage in Rllib? I have already set the number of GPUs to one. submitted by /u/centripetalstranger [link] [comments]  ( 42 min )
    Weird convergence of PPO reward when reducing number of envs
    Hi all, I am using Isaac Gym which enables the usage of multi environments. However, the reward value from the best environment has a huge difference, when training the agent with 512 environment (green) and 32 environment (orange), see below. I understand that the training should be slower when using less environments at the same time, but this difference tells me that I am missing something here... Does anyone have some hints? https://preview.redd.it/3rf0ax8653ea1.png?width=1589&format=png&auto=webp&s=b12defff668f186381b32c9f0385499b4413a3cd Below you can see the configs that I used for the PPO algorithm: config: name: ${resolve_default:CustomTask,${....experiment}} full_experiment_name: ${.name} env_name: rlgpu ppo: True mixed_precision: False normalize_input: True normalize_value: True value_bootstrap: True num_actors: ${....task.env.numEnvs} reward_shaper: scale_value: 1.0 normalize_advantage: True gamma: 0.99 tau: 0.95 learning_rate: 5e-4 lr_schedule: adaptive kl_threshold: 0.008 score_to_win: 10000000 max_epochs: ${resolve_default:5000,${....max_iterations}} save_best_after: 200 save_frequency: 100 print_stats: False use_action_masks: False grad_norm: 1.0 entropy_coef: 0.0001 truncate_grads: True e_clip: 0.2 horizon_length: 32 # num_envs * horizon length % minibatch_size minibatch_size: 1024 mini_epochs: 8 critic_coef: 4 clip_value: True seq_len: 4 bounds_loss_coef: 0.0001 submitted by /u/Fun-Moose-3841 [link] [comments]  ( 41 min )
  • Open

    Ten Productivity Hacks using ChaptGPT Generative AI Prompts
    Generative AI is suddenly everywhere. Because of this, the future of AI looks very bright indeed. There are many opportunities for generative AI to impact life and business in both positive and negative ways in the near future. Because the consequence of negative human impacts can easily far outweigh the benefits of positive human impacts, the… Read More »Ten Productivity Hacks using ChaptGPT Generative AI Prompts The post Ten Productivity Hacks using ChaptGPT Generative AI Prompts appeared first on Data Science Central.  ( 21 min )
    Innovation at the Convergence of Emerging Technologies: Business at the Edge
    In the context of digital transformation and innovation, there is no lack of “hot topics” to discuss. Emerging technologies are truly emerging everywhere. What is most exciting – and what demonstrates their greatest promise – is that these new technologies are converging to produce innovative new businesses, products, and services. Over the past decade, we… Read More »Innovation at the Convergence of Emerging Technologies: Business at the Edge The post Innovation at the Convergence of Emerging Technologies: Business at the Edge appeared first on Data Science Central.  ( 22 min )
    Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems)
    In a recent article on Autonomous Intelligent Systems (AIS) [1], Ajit Joakar described various features and characteristics of such systems, including associated technologies and research areas, building blocks and core elements, critical factors for success, and cross-cutting enablers. He introduces AIS as an “emerging interdisciplinary field that deals with situations where humans interact with AI systems… Read More »Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems) The post Five Principles of Safe Driving in AIS (Autonomous Intelligent Systems) appeared first on Data Science Central.  ( 23 min )
  • Open

    Learning with Queried Hints
    Posted by Sreenivas Gollapudi, Senior Staff Research Scientist, and Kostas Kollias, Staff Research Scientist, Google Research, Algorithms & Optimization Team In many computing applications the system needs to make decisions to serve requests that arrive in an online fashion. Consider, for instance, the example of a navigation app that responds to driver requests. In such settings there is inherent uncertainty about important aspects of the problem. For example, the preferences of the driver with respect to features of the route are often unknown and the delays of road segments can be uncertain. The field of online machine learning studies such settings and provides various techniques for decision-making problems under uncertainty. A navigation engine has to decide how to route thi…  ( 92 min )
  • Open

    Newbie in need – very bad NLP text generator needed
    Hell y'all, after spending an hour typing various combinations of "AI", "TEXT GENERATOR", "DATA FEEDING" and such I come to you with a humble request; can someone recommend me an AI text generator that needs to be fed actual, existing text(s) instead of giving it a prompt? I need something that will create a text based on an essay I will upload, and I really don't need the result to be super great. I will accept and in fact warmly welcome any nonsensical output, as I need AI to spew out trash, actually. I just need that trash to resemble things that actually exist. Grammatical errors are great, idiotic sentences like "actually, when the orange fades into original edition of Alexander, then how do we expect the succession to leave Canada?" is what I need, the less the final product sounds like what a human could write, the better. I don't have the proper lingo to google what I need to find, so I am deeply grateful for any suggestions. submitted by /u/lindybopperette [link] [comments]  ( 42 min )
    Computer Vision Development
    Hey!! I'm new to this side of things. I studied psych research, always had an interest in data visualization and neuroscience but didn't realize I should be piecing the two interests together and I have been too intimidated to take on the task of learning computer science. But I can't help myself any longer! I'm so fascinated and think reddit could be a great place to learn and chat about concepts. Any who... YA! I've started watching https://www.youtube.com/watch?v=vT1JzLTH4G4&list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv&index=1 and already can't believe we haven't solved the process of vision. Have we? Can we? The meta is getting to me. submitted by /u/angelacarolei [link] [comments]  ( 41 min )
  • Open

    Research Focus: Week of January 23, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Revolutionizing Document AI with multimodal document foundation models   Organizations must digitize various documents, many with charts and images, to manage and streamline essential functions. Yet manually […] The post Research Focus: Week of January 23, 2023 appeared first on Microsoft Research.  ( 8 min )
    Biomedical Research Platform Terra Now Available on Microsoft Azure
    We stand at the threshold of a new era of precision medicine, where health and life sciences data hold the potential to dramatically propel and expand our understanding and treatment of human disease. One of the tools that we believe will help to enable precision medicine is Terra, the secure biomedical research platform co-developed by […] The post Biomedical Research Platform Terra Now Available on Microsoft Azure appeared first on Microsoft Research.  ( 9 min )
  • Open

    Build a loyalty points anomaly detector using Amazon Lookout for Metrics
    Today, gaining customer loyalty cannot be a one-off thing. A brand needs a focused and integrated plan to retain its best customers—put simply, it needs a customer loyalty program. Earn and burn programs are one of the main paradigms. A typical earn and burn program rewards customers after a certain number of visits or spend. […]  ( 7 min )
    Explain text classification model predictions using Amazon SageMaker Clarify
    Model explainability refers to the process of relating the prediction of a machine learning (ML) model to the input feature values of an instance in humanly understandable terms. This field is often referred to as explainable artificial intelligence (XAI). Amazon SageMaker Clarify is a feature of Amazon SageMaker that enables data scientists and ML engineers […]  ( 10 min )
    Upscale images with Stable Diffusion in Amazon SageMaker JumpStart
    In November 2022, we announced that AWS customers can generate images from text with Stable Diffusion models in Amazon SageMaker JumpStart. Today, we announce a new feature that lets you upscale images (resize images without losing quality) with Stable Diffusion models in JumpStart. An image that is low resolution, blurry, and pixelated can be converted […]  ( 10 min )
    Cohere brings language AI to Amazon SageMaker
    It’s an exciting day for the development community. Cohere’s state-of-the-art language AI is now available through Amazon SageMaker. This makes it easier for developers to deploy Cohere’s pre-trained generation language model to Amazon SageMaker, an end-to-end machine learning (ML) service. Developers, data scientists, and business analysts use Amazon SageMaker to build, train, and deploy ML models quickly and easily using its fully managed infrastructure, tools, and workflows.  ( 6 min )
  • Open

    Braced From Space: Startup Keeps Watchful Eye on Gas Pipeline Leaks Across the Globe
    As its name suggests, Orbital Sidekick is creating technology that acts as a buddy in outer space, keeping an eye on the globe using satellites to help keep it safe and sustainable. The San Francisco-based startup, a member of the NVIDIA Inception program, enables commercial and government users to optimize sustainable operations and security with Read article >  ( 6 min )
    NVIDIA CEO Ignites AI Conversation in Stockholm
    Jensen Huang headlines Stockholm AI confab, Berzelius supercomputer upgraded to 94 NVIDIA DGX A100 systems.  ( 6 min )
  • Open

    Number of bits in a particular integer
    When I think of bit twiddling, I think of C. So I was surprised to read Paul Khuong saying he thinks of Common Lisp (“CL”). As always when working with bits, I first doodled in SLIME/SBCL: CL’s bit manipulation functions are more expressive than C’s, and a REPL helps exploration. I would not have thought […] Number of bits in a particular integer first appeared on John D. Cook.  ( 5 min )
  • Open

    Deep Learning-Based Assessment of Cerebral Microbleeds in COVID-19. (arXiv:2301.09322v1 [eess.IV])
    Cerebral Microbleeds (CMBs), typically captured as hypointensities from susceptibility-weighted imaging (SWI), are particularly important for the study of dementia, cerebrovascular disease, and normal aging. Recent studies on COVID-19 have shown an increase in CMBs of coronavirus cases. Automatic detection of CMBs is challenging due to the small size and amount of CMBs making the classes highly imbalanced, lack of publicly available annotated data, and similarity with CMB mimics such as calcifications, irons, and veins. Hence, the existing deep learning methods are mostly trained on very limited research data and fail to generalize to unseen data with high variability and cannot be used in clinical setups. To this end, we propose an efficient 3D deep learning framework that is actively trained on multi-domain data. Two public datasets assigned for normal aging, stroke, and Alzheimer's disease analysis as well as an in-house dataset for COVID-19 assessment are used to train and evaluate the models. The obtained results show that the proposed method is robust to low-resolution images and achieves 78% recall and 80% precision on the entire test set with an average false positive of 1.6 per scan.  ( 2 min )
    Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning. (arXiv:2301.09350v1 [cs.CL])
    Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors, representing topics of interest for the biomedical community. Several related but distinct biomedical concepts are often grouped together in a single coarse-grained descriptor and are treated as a single topic for semantic indexing. This study proposes a new method for the automated refinement of subject annotations at the level of concepts, investigating deep learning approaches. Lacking labelled data for this task, our method relies on weak supervision based on concept occurrence in the abstract of an article. The proposed approach is evaluated on an extended large-scale retrospective scenario, taking advantage of concepts that eventually become MeSH descriptors, for which annotations become available in MEDLINE/PubMed. The results suggest that concept occurrence is a strong heuristic for automated subject annotation refinement and can be further enhanced when combined with dictionary-based heuristics. In addition, such heuristics can be useful as weak supervision for developing deep learning models that can achieve further improvement in some cases.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Parallel Approaches to Accelerate Bayesian Decision Trees. (arXiv:2301.09090v1 [stat.CO])
    Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Existing work on Bayesian decision trees uses MCMC. Unfortunately, this can be slow, especially when considering large volumes of data. It is hard to parallelise the accept-reject component of the MCMC. None-the-less, we propose two methods for exploiting parallelism in the MCMC: in the first, we replace the MCMC with another numerical Bayesian approach, the Sequential Monte Carlo (SMC) sampler, which has the appealing property that it is an inherently parallel algorithm; in the second, we consider data partitioning. Both methods use multi-core processing with a HighPerformance Computing (HPC) resource. We test the two methods in various study settings to determine which method is the most beneficial for each test case. Experiments show that data partitioning has limited utility in the settings we consider and that the use of the SMC sampler can improve run-time (compared to the sequential implementation) by up to a factor of 343.  ( 2 min )
    Learning Reservoir Dynamics with Temporal Self-Modulation. (arXiv:2301.09235v1 [cs.LG])
    Reservoir computing (RC) can efficiently process time-series data by transferring the input signal to randomly connected recurrent neural networks (RNNs), which are referred to as a reservoir. The high-dimensional representation of time-series data in the reservoir significantly simplifies subsequent learning tasks. Although this simple architecture allows fast learning and facile physical implementation, the learning performance is inferior to that of other state-of-the-art RNN models. In this paper, to improve the learning ability of RC, we propose self-modulated RC (SM-RC), which extends RC by adding a self-modulation mechanism. The self-modulation mechanism is realized with two gating variables: an input gate and a reservoir gate. The input gate modulates the input signal, and the reservoir gate modulates the dynamical properties of the reservoir. We demonstrated that SM-RC can perform attention tasks where input information is retained or discarded depending on the input signal. We also found that a chaotic state emerged as a result of learning in SM-RC. This indicates that self-modulation mechanisms provide RC with qualitatively different information-processing capabilities. Furthermore, SM-RC outperformed RC in NARMA and Lorentz model tasks. In particular, SM-RC achieved a higher prediction accuracy than RC with a reservoir 10 times larger in the Lorentz model tasks. Because the SM-RC architecture only requires two additional gates, it is physically implementable as RC, providing a new direction for realizing edge AI.  ( 2 min )
    Dataset Distillation: A Comprehensive Review. (arXiv:2301.07014v2 [cs.LG] UPDATED)
    Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks.Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training \emph{per se} yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation~(DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive experiments and envision possible directions for future works.  ( 2 min )
    Talking About Large Language Models. (arXiv:2212.03551v3 [cs.CL] UPDATED)
    Thanks to rapid progress in artificial intelligence, we have entered an era when technology and philosophy intersect in interesting ways. Sitting squarely at the centre of this intersection are large language models (LLMs). The more adept LLMs become at mimicking human language, the more vulnerable we become to anthropomorphism, to seeing the systems in which they are embedded as more human-like than they really are. This trend is amplified by the natural tendency to use philosophically loaded terms, such as "knows", "believes", and "thinks", when describing these systems. To mitigate this trend, this paper advocates the practice of repeatedly stepping back to remind ourselves of how LLMs, and the systems of which they form a part, actually work. The hope is that increased scientific precision will encourage more philosophical nuance in the discourse around artificial intelligence, both within the field and in the public sphere.  ( 2 min )
    Generative Adversarial Networks to infer velocity components in rotating turbulent flows. (arXiv:2301.07541v1 [physics.flu-dyn] CROSS LISTED)
    Inference problems for two-dimensional snapshots of rotating turbulent flows are studied. We perform a systematic quantitative benchmark of point-wise and statistical reconstruction capabilities of the linear Extended Proper Orthogonal Decomposition (EPOD) method, a non-linear Convolutional Neural Network (CNN) and a Generative Adversarial Network (GAN). We attack the important task of inferring one velocity component out of the measurement of a second one, and two cases are studied: (I) both components lay in the plane orthogonal to the rotation axis and (II) one of the two is parallel to the rotation axis. We show that EPOD method works well only for the former case where both components are strongly correlated, while CNN and GAN always outperform EPOD both concerning point-wise and statistical reconstructions. For case (II), when the input and output data are weakly correlated, all methods fail to reconstruct faithfully the point-wise information. In this case, only GAN is able to reconstruct the field in a statistical sense. The analysis is performed using both standard validation tools based on L2 spatial distance between the prediction and the ground truth and more sophisticated multi-scale analysis using wavelet decomposition. Statistical validation is based on standard Jensen-Shannon divergence between the probability density functions, spectral properties and multi-scale flatness.  ( 2 min )
    Reconstructing Rayleigh-Benard flows out of temperature-only measurements using Physics-Informed Neural Networks. (arXiv:2301.07769v1 [physics.flu-dyn] CROSS LISTED)
    We investigate the capabilities of Physics-Informed Neural Networks (PINNs) to reconstruct turbulent Rayleigh-Benard flows using only temperature information. We perform a quantitative analysis of the quality of the reconstructions at various amounts of low-passed-filtered information and turbulent intensities. We compare our results with those obtained via nudging, a classical equation-informed data assimilation technique. At low Rayleigh numbers, PINNs are able to reconstruct with high precision, comparable to the one achieved with nudging. At high Rayleigh numbers, PINNs outperform nudging and are able to achieve satisfactory reconstruction of the velocity fields only when data for temperature is provided with high spatial and temporal density. When data becomes sparse, the PINNs performance worsens, not only in a point-to-point error sense but also, and contrary to nudging, in a statistical sense, as can be seen in the probability density functions and energy spectra.  ( 2 min )
    Clustering Categorical Data: Soft Rounding k-modes. (arXiv:2210.09640v2 [cs.LG] UPDATED)
    Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.
    Learning-Based Data Storage [Vision] (Technical Report). (arXiv:2206.05778v3 [cs.DB] UPDATED)
    Deep neural network (DNN) and its variants have been extensively used for a wide spectrum of real applications such as image classification, face/speech recognition, fraud detection, and so on. In addition to many important machine learning tasks, as artificial networks emulating the way brain cells function, DNNs also show the capability of storing non-linear relationships between input and output data, which exhibits the potential of storing data via DNNs. We envision a new paradigm of data storage, "DNN-as-a-Database", where data are encoded in well-trained machine learning models. Compared with conventional data storage that directly records data in raw formats, learning-based structures (e.g., DNN) can implicitly encode data pairs of inputs and outputs and compute/materialize actual output data of different resolutions only if input data are provided. This new paradigm can greatly enhance the data security by allowing flexible data privacy settings on different levels, achieve low space consumption and fast computation with the acceleration of new hardware (e.g., Diffractive Neural Network and AI chips), and can be generalized to distributed DNN-based storage/computing. In this paper, we propose this novel concept of learning-based data storage, which utilizes a learning structure called learning-based memory unit (LMU), to store, organize, and retrieve data. As a case study, we use DNNs as the engine in the LMU, and study the data capacity and accuracy of the DNN-based data storage. Our preliminary experimental results show the feasibility of the learning-based data storage by achieving high (100%) accuracy of the DNN storage. We explore and design effective solutions to utilize the DNN-based data storage to manage and query relational tables. We discuss how to generalize our solutions to other data types (e.g., graphs) and environments such as distributed DNN storage/computing.
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.
    Computationally-efficient initialisation of GPs: The generalised variogram method. (arXiv:2210.05394v2 [cs.LG] UPDATED)
    We present a computationally-efficient strategy to find the hyperparameters of a Gaussian process (GP) avoiding the computation of the likelihood function. The found hyperparameters can then be used directly for regression or passed as initial conditions to maximum-likelihood (ML) training. Motivated by the fact that training a GP via ML is equivalent (on average) to minimising the KL-divergence between the true and learnt model, we set to explore different metrics/divergences among GPs that are computationally inexpensive and provide estimates close to those of ML. In particular, we identify the GP hyperparameters by projecting the empirical covariance or (Fourier) power spectrum onto a parametric family, thus proposing and studying various measures of discrepancy operating on the temporal or frequency domains. Our contribution extends the Variogram method developed by the geostatistics literature and, accordingly, it is referred to as the Generalised Variogram method (GVM). In addition to the theoretical presentation of GVM, we provide experimental validation in terms of accuracy, consistency with ML and computational complexity for different kernels using synthetic and real-world data.
    FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction. (arXiv:2212.01548v2 [cs.LG] UPDATED)
    Most cross-device federated learning (FL) studies focus on the model-homogeneous setting where the global server model and local client models are identical. However, such constraint not only excludes low-end clients who would otherwise make unique contributions to model training but also restrains clients from training large models due to on-device resource bottlenecks. In this work, we propose FedRolex, a partial training (PT)-based approach that enables model-heterogeneous FL and can train a global server model larger than the largest client model. At its core, FedRolex employs a rolling sub-model extraction scheme that allows different parts of the global server model to be evenly trained, which mitigates the client drift induced by the inconsistency between individual client models and server model architectures. We show that FedRolex outperforms state-of-the-art PT-based model-heterogeneous FL methods (e.g. Federated Dropout) and reduces the gap between model-heterogeneous and model-homogeneous FL, especially under the large-model large-dataset regime. In addition, we provide theoretical statistical analysis on its advantage over Federated Dropout and evaluate FedRolex on an emulated real-world device distribution to show that FedRolex can enhance the inclusiveness of FL and boost the performance of low-end devices that would otherwise not benefit from FL. Our code is available at: https://github.com/AIoT-MLSys-Lab/FedRolex
    Incorporating Task-specific Concept Knowledge into Script Learning. (arXiv:2209.00068v2 [cs.CL] UPDATED)
    In this paper, we present Tetris, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including preferences and history. To address this problem, we propose a novel approach, which uses two techniques to improve performance: (1) concept prompting, and (2) script-oriented contrastive learning that addresses step repetition and hallucination problems. On our WikiHow-based dataset, we find that both methods improve performance. The dataset, repository, and models will be publicly available to facilitate further research on this new task.
    Concept-level Debugging of Part-Prototype Networks. (arXiv:2205.15769v2 [cs.LG] UPDATED)
    Part-prototype Networks (ProtoPNets) are concept-based classifiers designed to achieve the same performance as black-box models without compromising transparency. ProtoPNets compute predictions based on similarity to class-specific part-prototypes learned to recognize parts of training examples, making it easy to faithfully determine what examples are responsible for any target prediction and why. However, like other models, they are prone to picking up confounders and shortcuts from the data, thus suffering from compromised prediction accuracy and limited generalization. We propose ProtoPDebug, an effective concept-level debugger for ProtoPNets in which a human supervisor, guided by the model's explanations, supplies feedback in the form of what part-prototypes must be forgotten or kept, and the model is fine-tuned to align with this supervision. Our experimental evaluation shows that ProtoPDebug outperforms state-of-the-art debuggers for a fraction of the annotation cost. An online experiment with laypeople confirms the simplicity of the feedback requested to the users and the effectiveness of the collected feedback for learning confounder-free part-prototypes. ProtoPDebug is a promising tool for trustworthy interactive learning in critical applications, as suggested by a preliminary evaluation on a medical decision making task.
    A Comprehensive Survey on Enterprise Financial Risk Analysis: Problems, Methods, Spotlights and Applications. (arXiv:2211.14997v2 [q-fin.RM] UPDATED)
    Enterprise financial risk analysis aims at predicting the enterprises' future financial risk.Due to the wide application, enterprise financial risk analysis has always been a core research issue in finance. Although there are already some valuable and impressive surveys on risk management, these surveys introduce approaches in a relatively isolated way and lack the recent advances in enterprise financial risk analysis. Due to the rapid expansion of the enterprise financial risk analysis, especially from the computer science and big data perspective, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing enterprise financial risk researches, as well as to summarize and interpret the mechanisms and the strategies of enterprise financial risk analysis in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. This paper provides a systematic literature review of over 300 articles published on enterprise risk analysis modelling over a 50-year period, 1968 to 2022. We first introduce the formal definition of enterprise risk as well as the related concepts. Then, we categorized the representative works in terms of risk type and summarized the three aspects of risk analysis. Finally, we compared the analysis methods used to model the enterprise financial risk. Our goal is to clarify current cutting-edge research and its possible future directions to model enterprise risk, aiming to fully understand the mechanisms of enterprise risk communication and influence and its application on corporate governance, financial institution and government regulation.
    On Investigating the Conservative Property of Score-Based Generative Models. (arXiv:2209.12753v2 [cs.LG] UPDATED)
    Existing Score-based Generative Models (SGMs) can be categorized into constrained SGMs (CSGMs) or unconstrained SGMs (USGMs) according to their parameterization approaches. CSGMs model probability density functions as Boltzmann distributions, and assign their predictions as the negative gradients of some scalar-valued energy functions. On the other hand, USGMs employ flexible architectures capable of directly estimating scores without the need to explicitly model energy functions. In this paper, we demonstrate that the architectural constraints of CSGMs may limit their modeling ability. In addition, we show that USGMs' inability to preserve the property of conservativeness may lead to degraded sampling performance in practice. To address the above issues, we propose Quasi-Conservative Score-based Generative Models (QCSGMs) for keeping the advantages of both CSGMs and USGMs. Our theoretical derivations demonstrate that the training objective of QCSGMs can be efficiently integrated into the training processes by leveraging the Hutchinson trace estimator. In addition, our experimental results on the CIFAR-10, CIFAR-100, ImageNet, and SVHN datasets validate the effectiveness of QCSGMs. Finally, we justify the advantage of QCSGMs using an example of a one-layered autoencoder.
    Learning from Long-Tailed Noisy Data with Sample Selection and Balanced Loss. (arXiv:2211.10906v2 [cs.LG] UPDATED)
    The success of deep learning depends on large-scale and well-curated training data, while data in real-world applications are commonly long-tailed and noisy. Many methods have been proposed to deal with long-tailed data or noisy data, while a few methods are developed to tackle long-tailed noisy data. To solve this, we propose a robust method for learning from long-tailed noisy data with sample selection and balanced loss. Specifically, we separate the noisy training data into clean labeled set and unlabeled set with sample selection, and train the deep neural network in a semi-supervised manner with a balanced loss based on model bias. Extensive experiments on benchmarks demonstrate that our method outperforms existing state-of-the-art methods.
    FairGBM: Gradient Boosting with Fairness Constraints. (arXiv:2209.07850v3 [cs.LG] UPDATED)
    Tabular data is prevalent in many high stakes domains, such as financial services or public policy. Gradient boosted decision trees (GBDT) are popular in these settings due to performance guarantees and low cost. However, in consequential decision-making fairness is a foremost concern. Despite GBDT's popularity, existing in-processing Fair ML methods are either inapplicable to GBDT, or incur in significant train time overhead, or are inadequate for problems with high class imbalance -- a typical issue in these domains. We present FairGBM, a dual ascent learning framework for training GBDT under fairness constraints, with little to no impact on predictive performance when compared to unconstrained GBDT. Since observational fairness metrics are non-differentiable, we have to employ a "proxy-Lagrangian" formulation using smooth convex error rate proxies to enable gradient-based optimization. Our implementation shows an order of magnitude speedup in training time when compared with related work, a pivotal aspect to foster the widespread adoption of FairGBM by real-world practitioners.
    STaSy: Score-based Tabular data Synthesis. (arXiv:2210.04018v2 [cs.LG] UPDATED)
    Tabular data synthesis is a long-standing research topic in machine learning. Many different methods have been proposed over the past decades, ranging from statistical methods to deep generative methods. However, it has not always been successful due to the complicated nature of real-world tabular data. In this paper, we present a new model named Score-based Tabular data Synthesis (STaSy) and its training strategy based on the paradigm of score-based generative modeling. Despite the fact that score-based generative models have resolved many issues in generative models, there still exists room for improvement in tabular data synthesis. Our proposed training strategy includes a self-paced learning technique and a fine-tuning strategy, which further increases the sampling quality and diversity by stabilizing the denoising score matching training. Furthermore, we also conduct rigorous experimental studies in terms of the generative task trilemma: sampling quality, diversity, and time. In our experiments with 15 benchmark tabular datasets and 7 baselines, our method outperforms existing methods in terms of task-dependant evaluations and diversity.
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.
    On the power of foundation models. (arXiv:2211.16327v2 [cs.AI] UPDATED)
    With infinitely many high-quality data points, infinite computational power, an infinitely large foundation model with a perfect training algorithm and guaranteed zero generalization error on the pretext task, can the model be used for everything? This question cannot be answered by the existing theory of representation, optimization or generalization, because the issues they mainly investigate are assumed to be nonexistent here. In this paper, we show that category theory provides powerful machinery to answer this question. We have proved three results. The first one limits the power of prompt-based learning, saying that the model can solve a downstream task with prompts if and only if the task is representable. The second one says fine tuning does not have this limit, as a foundation model with the minimum required power (up to symmetry) can theoretically solve downstream tasks with fine tuning and enough resources. Our final result can be seen as a new type of generalization theorem, showing that the foundation model can generate unseen objects from the target category (e.g., images) using the structural information from the source category (e.g., texts). Along the way, we provide a categorical framework for supervised and self-supervised learning, which might be of independent interest.
    SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. (arXiv:2210.05861v2 [cs.CV] UPDATED)
    Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
    Shortest Path Networks for Graph Property Prediction. (arXiv:2206.01003v4 [cs.LG] UPDATED)
    Most graph neural network models rely on a particular message passing paradigm, where the idea is to iteratively propagate node representations of a graph to each node in the direct neighborhood. While very prominent, this paradigm leads to information propagation bottlenecks, as information is repeatedly compressed at intermediary node representations, which causes loss of information, making it practically impossible to gather meaningful signals from distant nodes. To address this, we propose shortest path message passing neural networks, where the node representations of a graph are propagated to each node in the shortest path neighborhoods. In this setting, nodes can directly communicate between each other even if they are not neighbors, breaking the information bottleneck and hence leading to more adequately learned representations. Our framework generalizes message passing neural networks, resulting in a class of more expressive models, including some recent state-of-the-art models. We verify the capacity of a basic model of this framework on dedicated synthetic experiments, and on real-world graph classification and regression benchmarks, and obtain state-of-the art results.
    Optimism in Face of a Context: Regret Guarantees for Stochastic Contextual MDP. (arXiv:2207.11126v2 [cs.LG] UPDATED)
    We present regret minimization algorithms for stochastic contextual MDPs under minimum reachability assumption, using an access to an offline least square regression oracle. We analyze three different settings: where the dynamics is known, where the dynamics is unknown but independent of the context and the most challenging setting where the dynamics is unknown and context-dependent. For the latter, our algorithm obtains regret bound of $\widetilde{O}( (H+{1}/{p_{min}})H|S|^{3/2}\sqrt{|A|T\log(\max\{|\mathcal{G}|,|\mathcal{P}|\}/\delta)})$ with probability $1-\delta$, where $\mathcal{P}$ and $\mathcal{G}$ are finite and realizable function classes used to approximate the dynamics and rewards respectively, $p_{min}$ is the minimum reachability parameter, $S$ is the set of states, $A$ the set of actions, $H$ the horizon, and $T$ the number of episodes. To our knowledge, our approach is the first optimistic approach applied to contextual MDPs with general function approximation (i.e., without additional knowledge regarding the function class, such as it being linear and etc.). We present a lower bound of $\Omega(\sqrt{T H |S| |A| \ln(|\mathcal{G}|)/\ln(|A|)})$, on the expected regret which holds even in the case of known dynamics. Lastly, we discuss an extension of our results to CMDPs without minimum reachability, that obtains $\widetilde{O}(T^{3/4})$ regret.
    DIVISION: Memory Efficient Training via Dual Activation Precision. (arXiv:2208.04187v3 [cs.LG] UPDATED)
    Existing work of activation compressed training relies on searching for optimal bit-width during DNN training to reduce the quantization noise, which makes the procedure complicated and less transparent. To this end, we propose a simple and effective method to compress DNN training. Our method is motivated by an instructive observation: DNN backward propagation mainly utilizes the low-frequency component (LFC) of the activation maps, while the majority of memory is for caching the high-frequency component (HFC) during the training. This indicates the HFC of activation maps is highly redundant and compressible during DNN training, which inspires our proposed Dual Activation Precision (DIVISION). During the training, DIVISION preserves the high-precision copy of LFC and compresses the HFC into a light-weight copy with low numerical precision. This can significantly reduce the memory cost without negatively affecting the precision of backward propagation such that DIVISION maintains competitive model accuracy. Experimental results show DIVISION achieves over 10x compression of activation maps, and significantly higher training throughput than state-of-the-art ACT methods, without loss of model accuracy.
    Explainable Image Quality Assessments in Teledermatological Photography. (arXiv:2209.04699v2 [cs.CV] UPDATED)
    Image quality is a crucial factor in the effectiveness and efficiency of teledermatological consultations. However, up to 50% of images sent by patients have quality issues, thus increasing the time to diagnosis and treatment. An automated, easily deployable, explainable method for assessing image quality is necessary to improve the current teledermatological consultation flow. We introduce ImageQX, a convolutional neural network for image quality assessment with a learning mechanism for identifying the most common poor image quality explanations: bad framing, bad lighting, blur, low resolution, and distance issues. ImageQX was trained on 26,635 photographs and validated on 9,874 photographs, each annotated with image quality labels and poor image quality explanations by up to 12 board-certified dermatologists. The photographic images were taken between 2017 and 2019 using a mobile skin disease tracking application accessible worldwide. Our method achieves expert-level performance for both image quality assessment and poor image quality explanation. For image quality assessment, ImageQX obtains a macro F1-score of 0.73 +- 0.01, which places it within standard deviation of the pairwise inter-rater F1-score of 0.77 +- 0.07. For poor image quality explanations, our method obtains F1-scores of between 0.37 +- 0.01 and 0.70 +- 0.01, similar to the inter-rater pairwise F1-score of between 0.24 +- 0.15 and 0.83 +- 0.06. Moreover, with a size of only 15 MB, ImageQX is easily deployable on mobile devices. With an image quality detection performance similar to that of dermatologists, incorporating ImageQX into the teledermatology flow can enable a better, faster flow for remote consultations.
    Pseudo-Hamiltonian Neural Networks with State-Dependent External Forces. (arXiv:2206.02660v4 [cs.LG] UPDATED)
    Hybrid machine learning based on Hamiltonian formulations has recently been successfully demonstrated for simple mechanical systems, both energy conserving and not energy conserving. We introduce a pseudo-Hamiltonian formulation that is a generalization of the Hamiltonian formulation via the port-Hamiltonian formulation, and show that pseudo-Hamiltonian neural network models can be used to learn external forces acting on a system. We argue that this property is particularly useful when the external forces are state dependent, in which case it is the pseudo-Hamiltonian structure that facilitates the separation of internal and external forces. Numerical results are provided for a forced and damped mass-spring system and a tank system of higher complexity, and a symmetric fourth-order integration scheme is introduced for improved training on sparse and noisy data.
    Overfitting in quantum machine learning and entangling dropout. (arXiv:2205.11446v2 [quant-ph] UPDATED)
    The ultimate goal in machine learning is to construct a model function that has a generalization capability for unseen dataset, based on given training dataset. If the model function has too much expressibility power, then it may overfit to the training data and as a result lose the generalization capability. To avoid such overfitting issue, several techniques have been developed in the classical machine learning regime, and the dropout is one such effective method. This paper proposes a straightforward analogue of this technique in the quantum machine learning regime, the entangling dropout, meaning that some entangling gates in a given parametrized quantum circuit are randomly removed during the training process to reduce the expressibility of the circuit. Some simple case studies are given to show that this technique actually suppresses the overfitting.
    On A Mallows-type Model For (Ranked) Choices. (arXiv:2207.01783v2 [cs.LG] UPDATED)
    We consider a preference learning setting where every participant chooses an ordered list of $k$ most preferred items among a displayed set of candidates. (The set can be different for every participant.) We identify a distance-based ranking model for the population's preferences and their (ranked) choice behavior. The ranking model resembles the Mallows model but uses a new distance function called Reverse Major Index (RMJ). We find that despite the need to sum over all permutations, the RMJ-based ranking distribution aggregates into (ranked) choice probabilities with simple closed-form expression. We develop effective methods to estimate the model parameters and showcase their generalization power using real data, especially when there is a limited variety of display sets.
    Predicting highway lane-changing maneuvers: A benchmark analysis of machine and ensemble learning algorithms. (arXiv:2204.10807v3 [cs.LG] UPDATED)
    Understanding and predicting highway lane-change maneuvers is essential for driving modeling and its automation. The development of data-based lane-changing decision-making algorithms is nowadays in full expansion. We compare empirically in this article different machine and ensemble learning classification techniques to the MOBIL rule-based model using trajectory data of European two-lane highways. The analysis relies on instantaneous measurements of up to twenty-four spatial-temporal variables with the four neighboring vehicles on current and adjacent lanes. Preliminary descriptive investigations by principal component and logistic analyses allow identifying main variables intending a driver to change lanes. We predict two types of discretionary lane-change maneuvers: overtaking (from the slow to the fast lane) and fold-down (from the fast to the slow lane). The prediction accuracy is quantified using total, lane-changing and lane-keeping errors and associated receiver operating characteristic curves. The benchmark analysis includes logistic model, linear discriminant, decision tree, na\"ive Bayes classifier, support vector machine, neural network machine learning algorithms, and up to ten bagging and stacking ensemble learning meta-heuristics. If the rule-based model provides limited predicting accuracy, the data-based algorithms, devoid of modeling bias, allow significant prediction improvements. Cross validations show that selected neural networks and stacking algorithms allow predicting from a single observation both fold-down and overtaking maneuvers up to four seconds in advance with high accuracy.
    EvenNet: Ignoring Odd-Hop Neighbors Improves Robustness of Graph Neural Networks. (arXiv:2205.13892v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have received extensive research attention for their promising performance in graph machine learning. Despite their extraordinary predictive accuracy, existing approaches, such as GCN and GPRGNN, are not robust in the face of homophily changes on test graphs, rendering these models vulnerable to graph structural attacks and with limited capacity in generalizing to graphs of varied homophily levels. Although many methods have been proposed to improve the robustness of GNN models, most of these techniques are restricted to the spatial domain and employ complicated defense mechanisms, such as learning new graph structures or calculating edge attentions. In this paper, we study the problem of designing simple and robust GNN models in the spectral domain. We propose EvenNet, a spectral GNN corresponding to an even-polynomial graph filter. Based on our theoretical analysis in both spatial and spectral domains, we demonstrate that EvenNet outperforms full-order models in generalizing across homophilic and heterophilic graphs, implying that ignoring odd-hop neighbors improves the robustness of GNNs. We conduct experiments on both synthetic and real-world datasets to demonstrate the effectiveness of EvenNet. Notably, EvenNet outperforms existing defense models against structural attacks without introducing additional computational costs and maintains competitiveness in traditional node classification tasks on homophilic and heterophilic graphs.
    Contrastive Learning for Unsupervised Domain Adaptation of Time Series. (arXiv:2206.06243v3 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims at learning a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain. UDA is important in many applications such as medicine, where it is used to adapt risk scores across different patient cohorts. In this paper, we develop a novel framework for UDA of time series data, called CLUDA. Specifically, we propose a contrastive learning framework to learn contextual representations in multivariate time series, so that these preserve label information for the prediction task. In our framework, we further capture the variation in the contextual representations between source and target domain via a custom nearest-neighbor contrastive learning. To the best of our knowledge, ours is the first framework to learn domain-invariant, contextual representation for UDA of time series data. We evaluate our framework using a wide range of time series datasets to demonstrate its effectiveness and show that it achieves state-of-the-art performance for time series UDA.
    An Explainable-AI approach for Diagnosis of COVID-19 using MALDI-ToF Mass Spectrometry. (arXiv:2109.14099v2 [cs.LG] UPDATED)
    The severe acute respiratory syndrome coronavirus type-2 (SARS-CoV-2) caused a global pandemic and imposed immense effects on the global economy. Accurate, cost-effective, and quick tests have proven substantial in identifying infected people and mitigating the spread. Recently, multiple alternative platforms for testing coronavirus disease 2019 (COVID-19) have been published that show high agreement with current gold standard real-time polymerase chain reaction (RT-PCR) results. These new methods do away with nasopharyngeal (NP) swabs, eliminate the need for complicated reagents, and reduce the burden on RT-PCR test reagent supply. In the present work, we have designed an artificial intelligence-based (AI) testing method to provide confidence in the results. Current AI applications to COVID-19 studies often lack a biological foundation in the decision-making process, and our AI approach is one of the earliest to leverage explainable-AI (X-AI) algorithms for COVID-19 diagnosis using mass spectrometry. Here, we have employed X-AI to explain the decision-making process on a local (per-sample) and global (all samples) basis underscored by biologically relevant features. We evaluated our technique with data extracted from human gargle samples and achieved a testing accuracy of 94.44%. Such techniques would strengthen the relationship between AI and clinical diagnostics by providing biomedical researchers and healthcare workers with trustworthy and, most importantly, explainable test results.
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.
    A Survey on Distributed Online Optimization and Game. (arXiv:2205.00473v2 [cs.LG] UPDATED)
    Distributed online optimization and game have been increasingly researched in the last decade, mostly motivated by its wide applications in sensor networks, robotics (e.g., distributed target tracking and formation control), smart grids, deep learning, and so forth. In these problems, there is a network of agents who may be cooperative (i.e., distributed online optimization) or noncooperative (i.e., online game) through local information exchanges. And the local cost function of each agent is often time-varying in dynamic and even adversarial environments. At each time, a decision must be made by each agent based on historical information at hand without knowing future information on cost functions. For these problems, a comprehensive survey is still lacking. This paper aims to provide a thorough overview of distributed online optimization and game from the perspective of problem settings, communication, computation, algorithms, and performances. In addition, some potential future directions are also discussed.
    Stochastic Second-Order Methods Improve Best-Known Sample Complexity of SGD for Gradient-Dominated Function. (arXiv:2205.12856v2 [cs.LG] UPDATED)
    We study the performance of Stochastic Cubic Regularized Newton (SCRN) on a class of functions satisfying gradient dominance property with $1\le\alpha\le2$ which holds in a wide range of applications in machine learning and signal processing. This condition ensures that any first-order stationary point is a global optimum. We prove that the total sample complexity of SCRN in achieving $\epsilon$-global optimum is $\mathcal{O}(\epsilon^{-7/(2\alpha)+1})$ for $1\le\alpha< 3/2$ and $\mathcal{\tilde{O}}(\epsilon^{-2/(\alpha)})$ for $3/2\le\alpha\le 2$. SCRN improves the best-known sample complexity of stochastic gradient descent. Even under a weak version of gradient dominance property, which is applicable to policy-based reinforcement learning (RL), SCRN achieves the same improvement over stochastic policy gradient methods. Additionally, we show that the average sample complexity of SCRN can be reduced to ${\mathcal{O}}(\epsilon^{-2})$ for $\alpha=1$ using a variance reduction method with time-varying batch sizes. Experimental results in various RL settings showcase the remarkable performance of SCRN compared to first-order methods.
    Sustaining Fairness via Incremental Learning. (arXiv:2208.12212v2 [cs.LG] UPDATED)
    Machine learning systems are often deployed for making critical decisions like credit lending, hiring, etc. While making decisions, such systems often encode the user's demographic information (like gender, age) in their intermediate representations. This can lead to decisions that are biased towards specific demographics. Prior work has focused on debiasing intermediate representations to ensure fair decisions. However, these approaches fail to remain fair with changes in the task or demographic distribution. To ensure fairness in the wild, it is important for a system to adapt to such changes as it accesses new data in an incremental fashion. In this work, we propose to address this issue by introducing the problem of learning fair representations in an incremental learning setting. To this end, we present Fairness-aware Incremental Representation Learning (FaIRL), a representation learning system that can sustain fairness while incrementally learning new tasks. FaIRL is able to achieve fairness and learn new tasks by controlling the rate-distortion function of the learned representations. Our empirical evaluations show that FaIRL is able to make fair decisions while achieving high performance on the target task, outperforming several baselines.
    Generating Diverse Teammates to Train Robust Agents For Ad Hoc Teamwork. (arXiv:2207.14138v2 [cs.LG] UPDATED)
    Ad hoc teamwork (AHT) is the challenge of designing a learner that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. However, optimising information-theoretic diversity metrics may generate teammates with superficially different behaviours, which does not necessarily result in a robust learner that can effectively collaborate with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.
    Do Gradient Inversion Attacks Make Federated Learning Unsafe?. (arXiv:2202.06924v2 [cs.LG] UPDATED)
    Federated learning (FL) allows the collaborative training of AI models without needing to share raw data. This capability makes it especially interesting for healthcare applications where patient and data privacy is of utmost concern. However, recent works on the inversion of deep neural networks from model gradients raised concerns about the security of FL in preventing the leakage of training data. In this work, we show that these attacks presented in the literature are impractical in FL use-cases where the clients' training involves updating the Batch Normalization (BN) statistics and provide a new baseline attack that works for such scenarios. Furthermore, we present new ways to measure and visualize potential data leakage in FL. Our work is a step towards establishing reproducible methods of measuring data leakage in FL and could help determine the optimal tradeoffs between privacy-preserving techniques, such as differential privacy, and model accuracy based on quantifiable metrics. Code is available at https://nvidia.github.io/NVFlare/research/quantifying-data-leakage.
    GANs and Closures: Micro-Macro Consistency in Multiscale Modeling. (arXiv:2208.10715v2 [cs.LG] UPDATED)
    Sampling the phase space of molecular systems -- and, more generally, of complex systems effectively modeled by stochastic differential equations -- is a crucial modeling step in many fields, from protein folding to materials discovery. These problems are often multiscale in nature: they can be described in terms of low-dimensional effective free energy surfaces parametrized by a small number of "slow" reaction coordinates; the remaining "fast" degrees of freedom populate an equilibrium measure on the reaction coordinate values. Sampling procedures for such problems are used to estimate effective free energy differences as well as ensemble averages with respect to the conditional equilibrium distributions; these latter averages lead to closures for effective reduced dynamic models. Over the years, enhanced sampling techniques coupled with molecular simulation have been developed. An intriguing analogy arises with the field of Machine Learning (ML), where Generative Adversarial Networks can produce high dimensional samples from low dimensional probability distributions. This sample generation returns plausible high dimensional space realizations of a model state, from information about its low-dimensional representation. In this work, we present an approach that couples physics-based simulations and biasing methods for sampling conditional distributions with ML-based conditional generative adversarial networks for the same task. The "coarse descriptors" on which we condition the fine scale realizations can either be known a priori, or learned through nonlinear dimensionality reduction. We suggest that this may bring out the best features of both approaches: we demonstrate that a framework that couples cGANs with physics-based enhanced sampling techniques can improve multiscale SDE dynamical systems sampling, and even shows promise for systems of increasing complexity.
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.
    Analyzing Data-Centric Properties for Graph Contrastive Learning. (arXiv:2208.02810v3 [cs.LG] UPDATED)
    Recent analyses of self-supervised learning (SSL) find the following data-centric properties to be critical for learning good representations: invariance to task-irrelevant semantics, separability of classes in some latent space, and recoverability of labels from augmented samples. However, given their discrete, non-Euclidean nature, graph datasets and graph SSL methods are unlikely to satisfy these properties. This raises the question: how do graph SSL methods, such as contrastive learning (CL), work well? To systematically probe this question, we perform a generalization analysis for CL when using generic graph augmentations (GGAs), with a focus on data-centric properties. Our analysis yields formal insights into the limitations of GGAs and the necessity of task-relevant augmentations. As we empirically show, GGAs do not induce task-relevant invariances on common benchmark datasets, leading to only marginal gains over naive, untrained baselines. Our theory motivates a synthetic data generation process that enables control over task-relevant information and boasts pre-defined optimal augmentations. This flexible benchmark helps us identify yet unrecognized limitations in advanced augmentation techniques (e.g., automated methods). Overall, our work rigorously contextualizes, both empirically and theoretically, the effects of data-centric properties on augmentation strategies and learning paradigms for graph SSL.
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.
    Predictive Model for Gross Community Production Rate of Coral Reefs using Ensemble Learning Methodologies. (arXiv:2111.04003v2 [cs.LG] UPDATED)
    Coral reefs play a vital role in maintaining the ecological balance of the marine ecosystem. Various marine organisms depend on coral reefs for their existence and their natural processes. Coral reefs provide the necessary habitat for reproduction and growth for various exotic species of the marine ecosystem. In this article, we discuss the most important parameters which influence the lifecycle of coral and coral reefs such as ocean acidification, deoxygenation and other physical parameters such as flow rate and surface area. Ocean acidification depends on the amount of dissolved Carbon dioxide (CO2). This is due to the release of H+ ions upon the reaction of the dissolved CO2 gases with the calcium carbonate compounds in the ocean. Deoxygenation is another problem that leads to hypoxia which is characterized by a lesser amount of dissolved oxygen in water than the required amount for the existence of marine organisms. In this article, we highlight the importance of physical parameters such as flow rate which influence gas exchange, heat dissipation, bleaching sensitivity, nutrient supply, feeding, waste and sediment removal, growth and reproduction. In this paper, we also bring out these important parameters and propose an ensemble machine learning-based model for analyzing these parameters and provide better rates that can help us to understand and suitably improve the ocean composition which in turn can eminently improve the sustainability of the marine ecosystem, mainly the coral reefs
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.
    Adversarial Examples in Random Neural Networks with General Activations. (arXiv:2203.17209v2 [cs.LG] UPDATED)
    A substantial body of empirical work documents the lack of robustness in deep learning models to adversarial examples. Recent theoretical work proved that adversarial examples are ubiquitous in two-layers networks with sub-exponential width and ReLU or smooth activations, and multi-layer ReLU networks with sub-exponential width. We present a result of the same type, with no restriction on width and for general locally Lipschitz continuous activations. More precisely, given a neural network $f(\,\cdot\,;{\boldsymbol \theta})$ with random weights ${\boldsymbol \theta}$, and feature vector ${\boldsymbol x}$, we show that an adversarial example ${\boldsymbol x}'$ can be found with high probability along the direction of the gradient $\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$. Our proof is based on a Gaussian conditioning technique. Instead of proving that $f$ is approximately linear in a neighborhood of ${\boldsymbol x}$, we characterize the joint distribution of $f({\boldsymbol x};{\boldsymbol \theta})$ and $f({\boldsymbol x}';{\boldsymbol \theta})$ for ${\boldsymbol x}' = {\boldsymbol x}-s({\boldsymbol x})\nabla_{{\boldsymbol x}}f({\boldsymbol x};{\boldsymbol \theta})$.
    MetaQA: Combining Expert Agents for Multi-Skill Question Answering. (arXiv:2112.01922v3 [cs.CL] UPDATED)
    The recent explosion of question answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or by combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer candidates. Through quantitative and qualitative experiments we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches in both in-domain and out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future developments of multi-agent systems on https://github.com/UKPLab/MetaQA.
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.
    Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering. (arXiv:2204.04581v3 [cs.CL] UPDATED)
    Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory to encode an index of knowledge for the model to retrieve over. Most prior work has employed text passages as the unit of knowledge, which has high coverage at the cost of interpretability, controllability, and efficiency. The opposite properties arise in other methods which have instead relied on knowledge base (KB) facts. At the same time, more recent work has demonstrated the effectiveness of storing and retrieving from an index of Q-A pairs derived from text \citep{lewis2021paq}. This approach yields a high coverage knowledge representation that maintains KB-like properties due to its representations being more atomic units of information. In this work we push this line of research further by proposing a question-answer augmented encoder-decoder model and accompanying pretraining strategy. This yields an end-to-end system that not only outperforms prior QA retrieval methods on single-hop QA tasks but also enables compositional reasoning, as demonstrated by strong performance on two multi-hop QA datasets. Together, these methods improve the ability to interpret and control the model while narrowing the performance gap with passage retrieval systems.
    Learning Regionally Decentralized AC Optimal Power Flows with ADMM. (arXiv:2205.03787v3 [eess.SY] UPDATED)
    One potential future for the next generation of smart grids is the use of decentralized optimization algorithms and secured communications for coordinating renewable generation (e.g., wind/solar), dispatchable devices (e.g., coal/gas/nuclear generations), demand response, battery & storage facilities, and topology optimization. The Alternating Direction Method of Multipliers (ADMM) has been widely used in the community to address such decentralized optimization problems and, in particular, the AC Optimal Power Flow (AC-OPF). This paper studies how machine learning may help in speeding up the convergence of ADMM for solving AC-OPF. It proposes a novel decentralized machine-learning approach, namely ML-ADMM, where each agent uses deep learning to learn the consensus parameters on the coupling branches. The paper also explores the idea of learning only from ADMM runs that exhibit high-quality convergence properties, and proposes filtering mechanisms to select these runs. Experimental results on test cases based on the French system demonstrate the potential of the approach in speeding up the convergence of ADMM significantly.
    Short Blocklength Wiretap Channel Codes via Deep Learning: Design and Performance Evaluation. (arXiv:2206.03477v2 [cs.IT] UPDATED)
    We design short blocklength codes for the Gaussian wiretap channel under information-theoretic security guarantees. Our approach consists in decoupling the reliability and secrecy constraints in our code design. Specifically, we handle the reliability constraint via an autoencoder, and handle the secrecy constraint with hash functions. For blocklengths smaller than or equal to 128, we evaluate through simulations the probability of error at the legitimate receiver and the leakage at the eavesdropper for our code construction. This leakage is defined as the mutual information between the confidential message and the eavesdropper's channel observations, and is empirically measured via a neural network-based mutual information estimator. Our simulation results provide examples of codes with positive secrecy rates that outperform the best known achievable secrecy rates obtained non-constructively for the Gaussian wiretap channel. Additionally, we show that our code design is suitable for the compound and arbitrarily varying Gaussian wiretap channels, for which the channel statistics are not perfectly known but only known to belong to a pre-specified uncertainty set. These models not only capture uncertainty related to channel statistics estimation, but also scenarios where the eavesdropper jams the legitimate transmission or influences its own channel statistics by changing its location.
    Improving Spectral Clustering Using Spectrum-Preserving Node Aggregation. (arXiv:2110.12328v6 [cs.LG] UPDATED)
    Spectral clustering is one of the most popular clustering methods. However, the high computational cost due to the involved eigen-decomposition procedure can immediately hinder its applications in large-scale tasks. In this paper we use spectrum-preserving node reduction to accelerate eigen-decomposition and generate concise representations of data sets. Specifically, we create a small number of pseudonodes based on spectral similarity. Then, standard spectral clustering algorithm is performed on the smaller node set. Finally, each data point in the original data set is assigned to the cluster as its representative pseudo-node. The proposed framework run in nearly-linear time. Meanwhile, the clustering accuracy can be significantly improved by mining concise representations. The experimental results show dramatically improved clustering performance when compared with state-of-the-art methods.
    Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback. (arXiv:2201.13172v2 [cs.LG] UPDATED)
    The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + d^k$, where the delay $d^k$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}^K d^k$ is the total delay, significantly improving upon the best known regret bound of $(K + D)^{2/3}$.
    Explainable deep learning for insights in El Ni\~no and river flows. (arXiv:2201.02596v3 [physics.ao-ph] UPDATED)
    The El Ni\~no Southern Oscillation (ENSO) is a semi-periodic fluctuation in sea surface temperature (SST) over the tropical central and eastern Pacific Ocean that influences interannual variability in regional hydrology across the world through long-range dependence or teleconnections. Recent research has demonstrated the value of Deep Learning (DL) methods for improving ENSO prediction as well as Complex Networks (CN) for understanding teleconnections. However, gaps in predictive understanding of ENSO-driven river flows include the black box nature of DL, the use of simple ENSO indices to describe a complex phenomenon and translating DL-based ENSO predictions to river flow predictions. Here we show that eXplainable DL (XDL) methods, based on saliency maps, can extract interpretable predictive information contained in global SST and discover SST information regions and dependence structures relevant for river flows which, in tandem with climate network constructions, enable improved predictive understanding. Our results reveal additional information content in global SST beyond ENSO indices, develop understanding of how SSTs influence river flows, and generate improved river flow prediction, including uncertainty estimation. Observations, reanalysis data, and earth system model simulations are used to demonstrate the value of the XDL-CN based methods for future interannual and decadal scale climate projections.
    A Context-Integrated Transformer-Based Neural Network for Auction Design. (arXiv:2201.12489v3 [cs.GT] UPDATED)
    One of the central problems in auction design is developing an incentive-compatible mechanism that maximizes the auctioneer's expected revenue. While theoretical approaches have encountered bottlenecks in multi-item auctions, recently, there has been much progress on finding the optimal mechanism through deep learning. However, these works either focus on a fixed set of bidders and items, or restrict the auction to be symmetric. In this work, we overcome such limitations by factoring \emph{public} contextual information of bidders and items into the auction learning framework. We propose $\mathtt{CITransNet}$, a context-integrated transformer-based neural network for optimal auction design, which maintains permutation-equivariance over bids and contexts while being able to find asymmetric solutions. We show by extensive experiments that $\mathtt{CITransNet}$ can recover the known optimal solutions in single-item settings, outperform strong baselines in multi-item auctions, and generalize well to cases other than those in training.
    Linear Connectivity Reveals Generalization Strategies. (arXiv:2205.12411v5 [cs.LG] UPDATED)
    It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the cluster -- models that occupy separate basins on the surface. By measuring performance on specially-crafted diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions.
    Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks. (arXiv:2204.07261v3 [cs.LG] UPDATED)
    We prove linear convergence of gradient descent to a global optimum for the training of deep residual networks with constant layer width and smooth activation function. We show that if the trained weights, as a function of the layer index, admit a scaling limit as the depth increases, then the limit has finite $p-$variation with $p=2$. Proofs are based on non-asymptotic estimates for the loss function and for norms of the network weights along the gradient descent path. We illustrate the relevance of our theoretical results to practical settings using detailed numerical experiments on supervised learning problems.
    A Unification Framework for Euclidean and Hyperbolic Graph Neural Networks. (arXiv:2206.04285v2 [cs.LG] UPDATED)
    Hyperbolic neural networks are able to capture the inherent hierarchy of graph datasets, and consequently a powerful choice of GNNs. However, they entangle multiple incongruent (gyro-)vector spaces within a layer, which makes them limited in terms of generalization and scalability. In this work, we propose to use Poincar\'e disk model as our search space, and apply all approximations on the disk (as if the disk is a tangent space derived from the origin), and thus getting rid of all inter-space transformations. Such an approach enables us to propose a hyperbolic normalization layer, and to further simplify the entire hyperbolic model to a Euclidean model cascaded with our hyperbolic normalization layer. We applied our proposed nonlinear hyperbolic normalization to the current state-of-the-art homogeneous and multi-relational graph networks. We demonstrate that not only does the model leverage the power of Euclidean networks such as interpretability and efficient execution of various model components, but also it outperforms both Euclidean and hyperbolic counterparts in our benchmarks.
    From Kepler to Newton: Explainable AI for Science. (arXiv:2111.12210v7 [cs.AI] UPDATED)
    The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific research, it has been sometimes very difficult to manually analyze the data and propose new hypotheses to drive the cycle for scientific discovery. In this paper, we discuss the role of Explainable AI in scientific discovery process by demonstrating an Explainable AI-based paradigm for science discovery. The key is to use Explainable AI to help derive data or model interpretations, hypotheses, as well as scientific discoveries or insights. We show how computational and data-intensive methodology -- together with experimental and theoretical methodology -- can be seamlessly integrated for scientific research. To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler's laws of planetary motion and Newton's law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe's astronomical observation data, whose works were leading the scientific revolution in the 16-17th century. This work also highlights the important role of Explainable AI (as compared to Blackbox AI) in science discovery to help humans prevent or better prepare for the possible technological singularity that may happen in the future, since science is not only about the know how, but also the know why. Presentation of the work is available at https://slideslive.com/38986142/from-kepler-to-newton-explainable-ai-for-science-discovery.
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.
    A proof that artificial neural networks overcome the curse of dimensionality in the numerical approximation of Black-Scholes partial differential equations. (arXiv:1809.02362v2 [math.NA] UPDATED)
    Artificial neural networks (ANNs) have very successfully been used in numerical simulations for a series of computational problems ranging from image classification/image recognition, speech recognition, time series analysis, game intelligence, and computational advertising to numerical approximations of partial differential equations (PDEs). Such numerical simulations suggest that ANNs have the capacity to very efficiently approximate high-dimensional functions and, especially, indicate that ANNs seem to admit the fundamental power to overcome the curse of dimensionality when approximating the high-dimensional functions appearing in the above named computational problems. There are a series of rigorous mathematical approximation results for ANNs in the scientific literature. Some of them prove convergence without convergence rates and some even rigorously establish convergence rates but there are only a few special cases where mathematical results can rigorously explain the empirical success of ANNs when approximating high-dimensional functions. The key contribution of this article is to disclose that ANNs can efficiently approximate high-dimensional functions in the case of numerical approximations of Black-Scholes PDEs. More precisely, this work reveals that the number of required parameters of an ANN to approximate the solution of the Black-Scholes PDE grows at most polynomially in both the reciprocal of the prescribed approximation accuracy $\varepsilon > 0$ and the PDE dimension $d \in \mathbb{N}$. We thereby prove, for the first time, that ANNs do indeed overcome the curse of dimensionality in the numerical approximation of Black-Scholes PDEs.
    SUPER-Net: Trustworthy Medical Image Segmentation with Uncertainty Propagation in Encoder-Decoder Networks. (arXiv:2111.05978v3 [eess.IV] UPDATED)
    Deep Learning (DL) holds great promise in reshaping the healthcare industry owing to its precision, efficiency, and objectivity. However, the brittleness of DL models to noisy and out-of-distribution inputs is ailing their deployment in the clinic. Most models produce point estimates without further information about model uncertainty or confidence. This paper introduces a new Bayesian DL framework for uncertainty quantification in segmentation neural networks: SUPER-Net: trustworthy medical image Segmentation with Uncertainty Propagation in Encoder-decodeR Networks. SUPER-Net analytically propagates, using Taylor series approximations, the first two moments (mean and covariance) of the posterior distribution of the model parameters across the nonlinear layers. In particular, SUPER-Net simultaneously learns the mean and covariance without expensive post-hoc Monte Carlo sampling or model ensembling. The output consists of two simultaneous maps: the segmented image and its pixelwise uncertainty map, which corresponds to the covariance matrix of the predictive distribution. We conduct an extensive evaluation of SUPER-Net on medical image segmentation of Magnetic Resonances Imaging and Computed Tomography scans under various noisy and adversarial conditions. Our experiments on multiple benchmark datasets demonstrate that SUPER-Net is more robust to noise and adversarial attacks than state-of-the-art segmentation models. Moreover, the uncertainty map of the proposed SUPER-Net associates low confidence (or equivalently high uncertainty) to patches in the test input images that are corrupted with noise, artifacts, or adversarial attacks. Perhaps more importantly, the model exhibits the ability of self-assessment of its segmentation decisions, notably when making erroneous predictions due to noise or adversarial examples.
    Toward Foundation Models for Earth Monitoring: Generalizable Deep Learning Models for Natural Hazard Segmentation. (arXiv:2301.09318v1 [cs.CV])
    Climate change results in an increased probability of extreme weather events that put societies and businesses at risk on a global scale. Therefore, near real-time mapping of natural hazards is an emerging priority for the support of natural disaster relief, risk management, and informing governmental policy decisions. Recent methods to achieve near real-time mapping increasingly leverage deep learning (DL). However, DL-based approaches are designed for one specific task in a single geographic region based on specific frequency bands of satellite data. Therefore, DL models used to map specific natural hazards struggle with their generalization to other types of natural hazards in unseen regions. In this work, we propose a methodology to significantly improve the generalizability of DL natural hazards mappers based on pre-training on a suitable pre-task. Without access to any data from the target domain, we demonstrate this improved generalizability across four U-Net architectures for the segmentation of unseen natural hazards. Importantly, our method is invariant to geographic differences and differences in the type of frequency bands of satellite data. By leveraging characteristics of unlabeled images from the target domain that are publicly available, our approach is able to further improve the generalization behavior without fine-tuning. Thereby, our approach supports the development of foundation models for earth monitoring with the objective of directly segmenting unseen natural hazards across novel geographic regions given different sources of satellite imagery.
    Synthesis of Compositional Animations from Textual Descriptions. (arXiv:2103.14675v6 [cs.CV] UPDATED)
    "How can we animate 3D-characters from a movie script or move robots by simply telling them what we would like them to do?" "How unstructured and complex can we make a sentence and still generate plausible movements from it?" These are questions that need to be answered in the long-run, as the field is still in its infancy. Inspired by these problems, we present a new technique for generating compositional actions, which handles complex input sentences. Our output is a 3D pose sequence depicting the actions in the input sentence. We propose a hierarchical two-stream sequential model to explore a finer joint-level mapping between natural language sentences and 3D pose sequences corresponding to the given motion. We learn two manifold representations of the motion -- one each for the upper body and the lower body movements. Our model can generate plausible pose sequences for short sentences describing single actions as well as long compositional sentences describing multiple sequential and superimposed actions. We evaluate our proposed model on the publicly available KIT Motion-Language Dataset containing 3D pose data with human-annotated sentences. Experimental results show that our model advances the state-of-the-art on text-based motion synthesis in objective evaluations by a margin of 50%. Qualitative evaluations based on a user study indicate that our synthesized motions are perceived to be the closest to the ground-truth motion captures for both short and compositional sentences.
    Rig Inversion by Training a Differentiable Rig Function. (arXiv:2301.09567v1 [cs.GR])
    Rig inversion is the problem of creating a method that can find the rig parameter vector that best approximates a given input mesh. In this paper we propose to solve this problem by first obtaining a differentiable rig function by training a multi layer perceptron to approximate the rig function. This differentiable rig function can then be used to train a deep learning model of rig inversion.
    Dataset Structural Index: Leveraging a machine's perspective towards visual data. (arXiv:2110.04070v3 [cs.CV] UPDATED)
    With advances in vision and perception architectures, we have realized that working with data is equally crucial, if not more, than the algorithms. Till today, we have trained machines based on our knowledge and perspective of the world. The entire concept of Dataset Structural Index(DSI) revolves around understanding a machine`s perspective of the dataset. With DSI, I show two meta values with which we can get more information over a visual dataset and use it to optimize data, create better architectures, and have an ability to guess which model would work best. These two values are the Variety contribution ratio and Similarity matrix. In the paper, I show many applications of DSI, one of which is how the same level of accuracy can be achieved with the same model architectures trained over less amount of data.
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.
    WDC Products: A Multi-Dimensional Entity Matching Benchmark. (arXiv:2301.09521v1 [cs.LG])
    The difficulty of an entity matching task depends on a combination of multiple factors such as the amount of corner-case pairs, the fraction of entities in the test set that have not been seen during training, and the size of the development set. Current entity matching benchmarks usually represent single points in the space along such dimensions or they provide for the evaluation of matching methods along a single dimension, for instance the amount of training data. This paper presents WDC Products, an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are (i) amount of corner-cases (ii) generalization to unseen entities, and (iii) development set size. Generalization to unseen entities is a dimension not covered by any of the existing benchmarks yet but is crucial for evaluating the robustness of entity matching systems. WDC Products is based on heterogeneous product data from thousands of e-shops which mark-up products offers using schema.org annotations. Instead of learning how to match entity pairs, entity matching can also be formulated as a multi-class classification task that requires the matcher to recognize individual entities. WDC Products is the first benchmark that provides a pair-wise and a multi-class formulation of the same tasks and thus allows to directly compare the two alternatives. We evaluate WDC Products using several state-of-the-art matching systems, including Ditto, HierGAT, and R-SupCon. The evaluation shows that all matching systems struggle with unseen entities to varying degrees. It also shows that some systems are more training data efficient than others.
    A Time Series Approach to Parkinson's Disease Classification from EEG. (arXiv:2301.09568v1 [q-bio.NC])
    Firstly, we present a novel representation for EEG data, a 7-variate series of band power coefficients, which enables the use of (previously inaccessible) time series classification methods. Specifically, we implement the multi-resolution representation-based time series classification method MrSQL. This is deployed on a challenging early-stage Parkinson's dataset that includes wakeful and sleep EEG. Initial results are promising with over 90% accuracy achieved on all EEG data types used. Secondly, we present a framework that enables high-importance data types and brain regions for classification to be identified. Using our framework, we find that, across different EEG data types, it is the Prefrontal brain region that has the most predictive power for the presence of Parkinson's Disease. This outperformance was statistically significant versus ten of the twelve other brain regions (not significant versus adjacent Left Frontal and Right Frontal regions). The Prefrontal region of the brain is important for higher-order cognitive processes and our results align with studies that have shown neural dysfunction in the prefrontal cortex in Parkinson's Disease.
    BayBFed: Bayesian Backdoor Defense for Federated Learning. (arXiv:2301.09508v1 [cs.LG])
    Federated learning (FL) allows participants to jointly train a machine learning model without sharing their private data with others. However, FL is vulnerable to poisoning attacks such as backdoor attacks. Consequently, a variety of defenses have recently been proposed, which have primarily utilized intermediary states of the global model (i.e., logits) or distance of the local models (i.e., L2-norm) from the global model to detect malicious backdoors. However, as these approaches directly operate on client updates, their effectiveness depends on factors such as clients' data distribution or the adversary's attack strategies. In this paper, we introduce a novel and more generic backdoor defense framework, called BayBFed, which proposes to utilize probability distributions over client updates to detect malicious updates in FL: it computes a probabilistic measure over the clients' updates to keep track of any adjustments made in the updates, and uses a novel detection algorithm that can leverage this probabilistic measure to efficiently detect and filter out malicious updates. Thus, it overcomes the shortcomings of previous approaches that arise due to the direct usage of client updates; as our probabilistic measure will include all aspects of the local client training strategies. BayBFed utilizes two Bayesian Non-Parametric extensions: (i) a Hierarchical Beta-Bernoulli process to draw a probabilistic measure given the clients' updates, and (ii) an adaptation of the Chinese Restaurant Process (CRP), referred by us as CRP-Jensen, which leverages this probabilistic measure to detect and filter out malicious updates. We extensively evaluate our defense approach on five benchmark datasets: CIFAR10, Reddit, IoT intrusion detection, MNIST, and FMNIST, and show that it can effectively detect and eliminate malicious updates in FL without deteriorating the benign performance of the global model.
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.
    FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing Flows. (arXiv:2301.09266v1 [cs.CV])
    Invertible convolutions have been an essential element for building expressive normalizing flow-based generative models since their introduction in Glow. Several attempts have been made to design invertible $k \times k$ convolutions that are efficient in training and sampling passes. Though these attempts have improved the expressivity and sampling efficiency, they severely lagged behind Glow which used only $1 \times 1$ convolutions in terms of sampling time. Also, many of the approaches mask a large number of parameters of the underlying convolution, resulting in lower expressivity on a fixed run-time budget. We propose a $k \times k$ convolutional layer and Deep Normalizing Flow architecture which i.) has a fast parallel inversion algorithm with running time O$(n k^2)$ ($n$ is height and width of the input image and k is kernel size), ii.) masks the minimal amount of learnable parameters in a layer. iii.) gives better forward pass and sampling times comparable to other $k \times k$ convolution-based models on real-world benchmarks. We provide an implementation of the proposed parallel algorithm for sampling using our invertible convolutions on GPUs. Benchmarks on CIFAR-10, ImageNet, and CelebA datasets show comparable performance to previous works regarding bits per dimension while significantly improving the sampling time.
    Estimating the energy requirements for long term memory formation. (arXiv:2301.09565v1 [q-bio.NC])
    Brains consume metabolic energy to process information, but also to store memories. The energy required for memory formation can be substantial, for instance in fruit flies memory formation leads to a shorter lifespan upon subsequent starvation (Mery and Kawecki, 2005). Here we estimate that the energy required corresponds to about 10mJ/bit and compare this to biophysical estimates as well as energy requirements in computer hardware. We conclude that biological memory storage is expensive, but the reason behind it is not known.
    Time-Conditioned Generative Modeling of Object-Centric Representations for Video Decomposition and Prediction. (arXiv:2301.08951v1 [cs.CV])
    When perceiving the world from multiple viewpoints, humans have the ability to reason about the complete objects in a compositional manner even when the object is completely occluded from partial viewpoints. Meanwhile, humans can imagine the novel views after observing multiple viewpoints. The remarkable recent advance in multi-view object-centric learning leaves some problems: 1) the partially or completely occluded shape of objects can not be well reconstructed. 2) the novel viewpoint prediction depends on expensive viewpoint annotations rather than implicit view rules. This makes the agent fail to perform like humans. In this paper, we introduce a time-conditioned generative model for videos. To reconstruct the complete shape of the object accurately, we enhance the disentanglement between different latent representations: view latent representations are jointly inferred based on the Transformer and then cooperate with the sequential extension of Slot Attention to learn object-centric representations. The model also achieves the new ability: Gaussian processes are employed as priors of view latent variables for generation and novel-view prediction without viewpoint annotations. Experiments on multiple specifically designed synthetic datasets have shown that the proposed model can 1) make the video decomposition, 2) reconstruct the complete shapes of objects, and 3) make the novel viewpoint prediction without viewpoint annotations.
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.
    The Entoptic Field Camera as Metaphor-Driven Research-through-Design with AI Technologies. (arXiv:2301.09545v1 [cs.HC])
    Artificial intelligence (AI) technologies are widely deployed in smartphone photography; and prompt-based image synthesis models have rapidly become commonplace. In this paper, we describe a Research-through-Design (RtD) project which explores this shift in the means and modes of image production via the creation and use of the Entoptic Field Camera. Entoptic phenomena usually refer to perceptions of floaters or bright blue dots stemming from the physiological interplay of the eye and brain. We use the term entoptic as a metaphor to investigate how the material interplay of data and models in AI technologies shapes human experiences of reality. Through our case study using first-person design and a field study, we offer implications for critical, reflective, more-than-human and ludic design to engage AI technologies; the conceptualisation of an RtD research space which contributes to AI literacy discourses; and outline a research trajectory concerning materiality and design affordances of AI technologies.
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.
    ECGAN: Self-supervised generative adversarial network for electrocardiography. (arXiv:2301.09496v1 [cs.LG])
    High-quality synthetic data can support the development of effective predictive models for biomedical tasks, especially in rare diseases or when subject to compelling privacy constraints. These limitations, for instance, negatively impact open access to electrocardiography datasets about arrhythmias. This work introduces a self-supervised approach to the generation of synthetic electrocardiography time series which is shown to promote morphological plausibility. Our model (ECGAN) allows conditioning the generative process for specific rhythm abnormalities, enhancing synchronization and diversity across samples with respect to literature models. A dedicated sample quality assessment framework is also defined, leveraging arrhythmia classifiers. The empirical results highlight a substantial improvement against state-of-the-art generative models for sequences and audio synthesis.
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.
    DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained Diffusion. (arXiv:2301.09474v1 [cs.LG])
    Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t.~a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFormer (diffusion-based Transformers), with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction.
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.
    Digital Twins for Marine Operations: A Brief Review on Their Implementation. (arXiv:2301.09574v1 [eess.SY])
    While the concept of a digital twin to support maritime operations is gaining attention for predictive maintenance, real-time monitoring, control, and overall process optimization, clarity on its implementation is missing in the literature. Therefore, in this review we show how different authors implemented their digital twins, discuss our findings, and finally give insights on future research directions.
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.
    M22: A Communication-Efficient Algorithm for Federated Learning Inspired by Rate-Distortion. (arXiv:2301.09269v1 [cs.LG])
    In federated learning (FL), the communication constraint between the remote learners and the Parameter Server (PS) is a crucial bottleneck. For this reason, model updates must be compressed so as to minimize the loss in accuracy resulting from the communication constraint. This paper proposes ``\emph{${\bf M}$-magnitude weighted $L_{\bf 2}$ distortion + $\bf 2$ degrees of freedom''} (M22) algorithm, a rate-distortion inspired approach to gradient compression for federated training of deep neural networks (DNNs). In particular, we propose a family of distortion measures between the original gradient and the reconstruction we referred to as ``$M$-magnitude weighted $L_2$'' distortion, and we assume that gradient updates follow an i.i.d. distribution -- generalized normal or Weibull, which have two degrees of freedom. In both the distortion measure and the gradient, there is one free parameter for each that can be fitted as a function of the iteration number. Given a choice of gradient distribution and distortion measure, we design the quantizer minimizing the expected distortion in gradient reconstruction. To measure the gradient compression performance under a communication constraint, we define the \emph{per-bit accuracy} as the optimal improvement in accuracy that one bit of communication brings to the centralized model over the training period. Using this performance measure, we systematically benchmark the choice of gradient distribution and distortion measure. We provide substantial insights on the role of these choices and argue that significant performance improvements can be attained using such a rate-distortion inspired compressor.
    Explaining the effects of non-convergent sampling in the training of Energy-Based Models. (arXiv:2301.09428v1 [cs.LG])
    In this paper, we quantify the impact of using non-convergent Markov chains to train Energy-Based models (EBMs). In particular, we show analytically that EBMs trained with non-persistent short runs to estimate the gradient can perfectly reproduce a set of empirical statistics of the data, not at the level of the equilibrium measure, but through a precise dynamical process. Our results provide a first-principles explanation for the observations of recent works proposing the strategy of using short runs starting from random initial conditions as an efficient way to generate high-quality samples in EBMs, and lay the groundwork for using EBMs as diffusion models. After explaining this effect in generic EBMs, we analyze two solvable models in which the effect of the non-convergent sampling in the trained parameters can be described in detail. Finally, we test these predictions numerically on the Boltzmann machine.
    StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. (arXiv:2301.09515v1 [cs.LG])
    Text-to-image synthesis has recently seen significant progress thanks to large pretrained language models, large-scale training data, and the introduction of scalable model families such as diffusion and autoregressive models. However, the best-performing models require iterative evaluation to generate a single sample. In contrast, generative adversarial networks (GANs) only need a single forward pass. They are thus much faster, but they currently remain far behind the state-of-the-art in large-scale text-to-image synthesis. This paper aims to identify the necessary steps to regain competitiveness. Our proposed model, StyleGAN-T, addresses the specific requirements of large-scale text-to-image synthesis, such as large capacity, stable training on diverse datasets, strong text alignment, and controllable variation vs. text alignment tradeoff. StyleGAN-T significantly improves over previous GANs and outperforms distilled diffusion models - the previous state-of-the-art in fast text-to-image synthesis - in terms of sample quality and speed.
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.
    DASTSiam: Spatio-Temporal Fusion and Discriminative Augmentation for Improved Siamese Tracking. (arXiv:2301.09063v1 [cs.CV])
    Tracking tasks based on deep neural networks have greatly improved with the emergence of Siamese trackers. However, the appearance of targets often changes during tracking, which can reduce the robustness of the tracker when facing challenges such as aspect ratio change, occlusion, and scale variation. In addition, cluttered backgrounds can lead to multiple high response points in the response map, leading to incorrect target positioning. In this paper, we introduce two transformer-based modules to improve Siamese tracking called DASTSiam: the spatio-temporal (ST) fusion module and the Discriminative Augmentation (DA) module. The ST module uses cross-attention based accumulation of historical cues to improve robustness against object appearance changes, while the DA module associates semantic information between the template and search region to improve target discrimination. Moreover, Modifying the label assignment of anchors also improves the reliability of the object location. Our modules can be used with all Siamese trackers and show improved performance on several public datasets through comparative and ablation experiments.
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.
    A Framework for Evaluating the Impact of Food Security Scenarios. (arXiv:2301.09320v1 [cs.LG])
    This study proposes an approach for predicting the impacts of scenarios on food security and demonstrates its application in a case study. The approach involves two main steps: (1) scenario definition, in which the end user specifies the assumptions and impacts of the scenario using a scenario template, and (2) scenario evaluation, in which a Vector Autoregression (VAR) model is used in combination with Monte Carlo simulation to generate predictions for the impacts of the scenario based on the defined assumptions and impacts. The case study is based on a proprietary time series food security database created using data from the Food and Agriculture Organization of the United Nations (FAOSTAT), the World Bank, and the United States Department of Agriculture (USDA). The database contains a wide range of data on various indicators of food security, such as production, trade, consumption, prices, availability, access, and nutritional value. The results show that the proposed approach can be used to predict the potential impacts of scenarios on food security and that the proprietary time series food security database can be used to support this approach. The study provides specific insights on how this approach can inform decision-making processes related to food security such as food prices and availability in the case study region.
    An iterative multi-fidelity approach for model order reduction of multi-dimensional input parametric PDE systems. (arXiv:2301.09483v1 [math.NA])
    We propose a parametric sampling strategy for the reduction of large-scale PDE systems with multidimensional input parametric spaces by leveraging models of different fidelity. The design of this methodology allows a user to adaptively sample points ad hoc from a discrete training set with no prior requirement of error estimators. It is achieved by exploiting low-fidelity models throughout the parametric space to sample points using an efficient sampling strategy, and at the sampled parametric points, high-fidelity models are evaluated to recover the reduced basis functions. The low-fidelity models are then adapted with the reduced order models ( ROMs) built by projection onto the subspace spanned by the recovered basis functions. The process continues until the low-fidelity model can represent the high-fidelity model adequately for all the parameters in the parametric space. Since the proposed methodology leverages the use of low-fidelity models to assimilate the solution database, it significantly reduces the computational cost in the offline stage. The highlight of this article is to present the construction of the initial low-fidelity model, and a sampling strategy based on the discrete empirical interpolation method (DEIM). We test this approach on a 2D steady-state heat conduction problem for two different input parameters and make a qualitative comparison with the classical greedy reduced basis method (RBM), and further test on a 9-dimensional parametric non-coercive elliptic problem and analyze the computational performance based on different tuning of greedy selection of points.
    Speeding Up BatchBALD: A k-BALD Family of Approximations for Active Learning. (arXiv:2301.09490v1 [cs.LG])
    Active learning is a powerful method for training machine learning models with limited labeled data. One commonly used technique for active learning is BatchBALD, which uses Bayesian neural networks to find the most informative points to label in a pool set. However, BatchBALD can be very slow to compute, especially for larger datasets. In this paper, we propose a new approximation, k-BALD, which uses k-wise mutual information terms to approximate BatchBALD, making it much less expensive to compute. Results on the MNIST dataset show that k-BALD is significantly faster than BatchBALD while maintaining similar performance. Additionally, we also propose a dynamic approach for choosing k based on the quality of the approximation, making it more efficient for larger datasets.
    A Simple Recipe for Competitive Low-compute Self supervised Vision Models. (arXiv:2301.09451v1 [cs.CV])
    Self-supervised methods in vision have been mostly focused on large architectures as they seem to suffer from a significant performance drop for smaller architectures. In this paper, we propose a simple self-supervised distillation technique that can train high performance low-compute neural networks. Our main insight is that existing joint-embedding based SSL methods can be repurposed for knowledge distillation from a large self-supervised teacher to a small student model. Thus, we call our method Replace one Branch (RoB) as it simply replaces one branch of the joint-embedding training with a large teacher model. RoB is widely applicable to a number of architectures such as small ResNets, MobileNets and ViT, and pretrained models such as DINO, SwAV or iBOT. When pretraining on the ImageNet dataset, RoB yields models that compete with supervised knowledge distillation. When applied to MSN, RoB produces students with strong semi-supervised capabilities. Finally, our best ViT-Tiny models improve over prior SSL state-of-the-art on ImageNet by $2.3\%$ and are on par or better than a supervised distilled DeiT on five downstream transfer tasks (iNaturalist, CIFAR, Clevr/Count, Clevr/Dist and Places). We hope RoB enables practical self-supervision at smaller scale.
    New Insights into Multi-Calibration. (arXiv:2301.08837v1 [cs.LG])
    We identify a novel connection between the recent literature on multi-group fairness for prediction algorithms and well-established notions of graph regularity from extremal graph theory. We frame our investigation using new, statistical distance-based variants of multi-calibration that are closely related to the concept of outcome indistinguishability. Adopting this perspective leads us naturally not only to our graph theoretic results, but also to new multi-calibration algorithms with improved complexity in certain parameter regimes, and to a generalization of a state-of-the-art result on omniprediction. Along the way, we also unify several prior algorithms for achieving multi-group fairness, as well as their analyses, through the lens of no-regret learning.
    LSTM and CNN application for core-collapse supernova search in gravitational wave real data. (arXiv:2301.09387v1 [astro-ph.IM])
    $Context.$ Core-collapse supernovae (CCSNe) are expected to emit gravitational wave signals that could be detected by current and future generation interferometers within the Milky Way and nearby galaxies. The stochastic nature of the signal arising from CCSNe requires alternative detection methods to matched filtering. $Aims.$ We aim to show the potential of machine learning (ML) for multi-label classification of different CCSNe simulated signals and noise transients using real data. We compared the performance of 1D and 2D convolutional neural networks (CNNs) on single and multiple detector data. For the first time, we tested multi-label classification also with long short-term memory (LSTM) networks. $Methods.$ We applied a search and classification procedure for CCSNe signals, using an event trigger generator, the Wavelet Detection Filter (WDF), coupled with ML. We used time series and time-frequency representations of the data as inputs to the ML models. To compute classification accuracies, we simultaneously injected, at detectable distance of 1\,kpc, CCSN waveforms, obtained from recent hydrodynamical simulations of neutrino-driven core-collapse, onto interferometer noise from the O2 LIGO and Virgo science run. $Results.$ We compared the performance of the three models on single detector data. We then merged the output of the models for single detector classification of noise and astrophysical transients, obtaining overall accuracies for LIGO ($\sim99\%$) and ($\sim80\%$) for Virgo. We extended our analysis to the multi-detector case using triggers coincident among the three ITFs and achieved an accuracy of $\sim98\%$.
    Ordinal Regression for Difficulty Estimation of StepMania Levels. (arXiv:2301.09485v1 [cs.LG])
    StepMania is a popular open-source clone of a rhythm-based video game. As is common in popular games, there is a large number of community-designed levels. It is often difficult for players and level authors to determine the difficulty level of such community contributions. In this work, we formalize and analyze the difficulty prediction task on StepMania levels as an ordinal regression (OR) task. We standardize a more extensive and diverse selection of this data resulting in five data sets, two of which are extensions of previous work. We evaluate many competitive OR and non-OR models, demonstrating that neural network-based models significantly outperform the state of the art and that StepMania-level data makes for an excellent test bed for deep OR models. We conclude with a user experiment showing our trained models' superiority over human labeling.
    Hierarchically branched diffusion models for efficient and interpretable multi-class conditional generation. (arXiv:2212.10777v2 [cs.LG] UPDATED)
    Diffusion models have achieved justifiable popularity by attaining state-of-the-art performance in generating realistic objects, including when conditioning generation on labels. Current diffusion models are universally linear in nature, modeling diffusion identically for objects of all classes. For the multi-class conditional generation problem, we propose a novel, structurally unique framework of diffusion models which are hierarchically branched according to the inherent relationships between classes. In this work, we showcase several advantages of branched diffusion models. We demonstrate that branched models generate samples more efficiently, and are more easily extended to novel classes in a continual-learning setting. We also show that branched models enjoy a unique interpretability that offers insight into the modeled data distribution. Branched diffusion models represent an alternative paradigm to their traditional linear counterparts, and can have large impacts in how we use diffusion models for efficient generation, online learning, and scientific discovery.
    LF-checker: Machine Learning Acceleration of Bounded Model Checking for Concurrency Verification (Competition Contribution). (arXiv:2301.09142v1 [cs.LG])
    We describe and evaluate LF-checker, a metaverifier tool based on machine learning. It extracts multiple features of the program under test and predicts the optimal configuration (flags) of a bounded model checker with a decision tree. Our current work is specialised in concurrency verification and employs ESBMC as a back-end verification engine. In the paper, we demonstrate that LF-checker achieves better results than the default configuration of the underlying verification engine.
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.
    SMDDH: Singleton Mention detection using Deep Learning in Hindi Text. (arXiv:2301.09361v1 [cs.CL])
    Mention detection is an important component of coreference resolution system, where mentions such as name, nominal, and pronominals are identified. These mentions can be purely coreferential mentions or singleton mentions (non-coreferential mentions). Coreferential mentions are those mentions in a text that refer to the same entities in a real world. Whereas, singleton mentions are mentioned only once in the text and do not participate in the coreference as they are not mentioned again in the following text. Filtering of these singleton mentions can substantially improve the performance of a coreference resolution process. This paper proposes a singleton mention detection module based on a fully connected network and a Convolutional neural network for Hindi text. This model utilizes a few hand-crafted features and context information, and word embedding for words. The coreference annotated Hindi dataset comprising of 3.6K sentences, and 78K tokens are used for the task. In terms of Precision, Recall, and F-measure, the experimental findings obtained are excellent.
    A Comprehensive Survey on Heart Sound Analysis in the Deep Learning Era. (arXiv:2301.09362v1 [cs.SD])
    Heart sound auscultation has been demonstrated to be beneficial in clinical usage for early screening of cardiovascular diseases. Due to the high requirement of well-trained professionals for auscultation, automatic auscultation benefiting from signal processing and machine learning can help auxiliary diagnosis and reduce the burdens of training professional clinicians. Nevertheless, classic machine learning is limited to performance improvement in the era of big data. Deep learning has achieved better performance than classic machine learning in many research fields, as it employs more complex model architectures with stronger capability of extracting effective representations. Deep learning has been successfully applied to heart sound analysis in the past years. As most review works about heart sound analysis were given before 2017, the present survey is the first to work on a comprehensive overview to summarise papers on heart sound analysis with deep learning in the past six years 2017--2022. We introduce both classic machine learning and deep learning for comparison, and further offer insights about the advances and future research directions in deep learning for heart sound analysis.
    MATT: Multimodal Attention Level Estimation for e-learning Platforms. (arXiv:2301.09174v1 [cs.CV])
    This work presents a new multimodal system for remote attention level estimation based on multimodal face analysis. Our multimodal approach uses different parameters and signals obtained from the behavior and physiological processes that have been related to modeling cognitive load such as faces gestures (e.g., blink rate, facial actions units) and user actions (e.g., head pose, distance to the camera). The multimodal system uses the following modules based on Convolutional Neural Networks (CNNs): Eye blink detection, head pose estimation, facial landmark detection, and facial expression features. First, we individually evaluate the proposed modules in the task of estimating the student's attention level captured during online e-learning sessions. For that we trained binary classifiers (high or low attention) based on Support Vector Machines (SVM) for each module. Secondly, we find out to what extent multimodal score level fusion improves the attention level estimation. The mEBAL database is used in the experimental framework, a public multi-modal database for attention level estimation obtained in an e-learning environment that contains data from 38 users while conducting several e-learning tasks of variable difficulty (creating changes in student cognitive loads).
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.
    Enabling Hard Constraints in Differentiable Neural Network and Accelerator Co-Exploration. (arXiv:2301.09312v1 [cs.LG])
    Co-exploration of an optimal neural architecture and its hardware accelerator is an approach of rising interest which addresses the computational cost problem, especially in low-profile systems. The large co-exploration space is often handled by adopting the idea of differentiable neural architecture search. However, despite the superior search efficiency of the differentiable co-exploration, it faces a critical challenge of not being able to systematically satisfy hard constraints such as frame rate. To handle the hard constraint problem of differentiable co-exploration, we propose HDX, which searches for hard-constrained solutions without compromising the global design objectives. By manipulating the gradients in the interest of the given hard constraint, high-quality solutions satisfying the constraint can be obtained.
    On Multi-Agent Deep Deterministic Policy Gradients and their Explainability for SMARTS Environment. (arXiv:2301.09420v1 [cs.LG])
    Multi-Agent RL or MARL is one of the complex problems in Autonomous Driving literature that hampers the release of fully-autonomous vehicles today. Several simulators have been in iteration after their inception to mitigate the problem of complex scenarios with multiple agents in Autonomous Driving. One such simulator--SMARTS, discusses the importance of cooperative multi-agent learning. For this problem, we discuss two approaches--MAPPO and MADDPG, which are based on-policy and off-policy RL approaches. We compare our results with the state-of-the-art results for this challenge and discuss the potential areas of improvement while discussing the explainability of these approaches in conjunction with waypoints in the SMARTS environment.
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.
    Prompt Federated Learning for Weather Forecasting: Toward Foundation Models on Meteorological Data. (arXiv:2301.09152v1 [cs.LG])
    To tackle the global climate challenge, it urgently needs to develop a collaborative platform for comprehensive weather forecasting on large-scale meteorological data. Despite urgency, heterogeneous meteorological sensors across countries and regions, inevitably causing multivariate heterogeneity and data exposure, become the main barrier. This paper develops a foundation model across regions capable of understanding complex meteorological data and providing weather forecasting. To relieve the data exposure concern across regions, a novel federated learning approach has been proposed to collaboratively learn a brand-new spatio-temporal Transformer-based foundation model across participants with heterogeneous meteorological data. Moreover, a novel prompt learning mechanism has been adopted to satisfy low-resourced sensors' communication and computational constraints. The effectiveness of the proposed method has been demonstrated on classical weather forecasting tasks using three meteorological datasets with multivariate time series.
    Lower Bounds on Learning Pauli Channels. (arXiv:2301.09192v1 [quant-ph])
    Understanding the noise affecting a quantum device is of fundamental importance for scaling quantum technologies. A particularly important class of noise models is that of Pauli channels, as randomized compiling techniques can effectively bring any quantum channel to this form and are significantly more structured than general quantum channels. In this paper, we show fundamental lower bounds on the sample complexity for learning Pauli channels in diamond norm with unentangled measurements. We consider both adaptive and non-adaptive strategies. In the non-adaptive setting, we show a lower bound of $\Omega(2^{3n}\epsilon^{-2})$ to learn an $n$-qubit Pauli channel. In particular, this shows that the recently introduced learning procedure by Flammia and Wallman is essentially optimal. In the adaptive setting, we show a lower bound of $\Omega(2^{2.5n}\epsilon^{-2})$ for $\epsilon=\mathcal{O}(2^{-n})$, and a lower bound of $\Omega(2^{2n}\epsilon^{-2} )$ for any $\epsilon > 0$. This last lower bound even applies for arbitrarily many sequential uses of the channel, as long as they are only interspersed with other unital operations.
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo
    Abstracting Imperfect Information Away from Two-Player Zero-Sum Games. (arXiv:2301.09159v1 [cs.GT])
    In their seminal work, Nayyar et al. (2013) showed that imperfect information can be abstracted away from common-payoff games by having players publicly announce their policies as they play. This insight underpins sound solvers and decision-time planning algorithms for common-payoff games. Unfortunately, a naive application of the same insight to two-player zero-sum games fails because Nash equilibria of the game with public policy announcements may not correspond to Nash equilibria of the original game. As a consequence, existing sound decision-time planning algorithms require complicated additional mechanisms that have unappealing properties. The main contribution of this work is showing that certain regularized equilibria do not possess the aforementioned non-correspondence problem -- thus, computing them can be treated as perfect information problems. Because these regularized equilibria can be made arbitrarily close to Nash equilibria, our result opens the door to a new perspective on solving two-player zero-sum games and, in particular, yields a simplified framework for decision-time planning in two-player zero-sum games, void of the unappealing properties that plague existing decision-time planning approaches.
    Optimising complexity of CNN models for resource constrained devices: QRS detection case study. (arXiv:2301.09232v1 [cs.LG])
    Traditional DL models are complex and resource hungry and thus, care needs to be taken in designing Internet of (medical) things (IoT, or IoMT) applications balancing efficiency-complexity trade-off. Recent IoT solutions tend to avoid using deep-learning methods due to such complexities, and rather classical filter-based methods are commonly used. We hypothesize that a shallow CNN model can offer satisfactory level of performance in combination by leveraging other essential solution-components, such as post-processing that is suitable for resource constrained environment. In an IoMT application context, QRS-detection and R-peak localisation from ECG signal as a case study, the complexities of CNN models and post-processing were varied to identify a set of combinations suitable for a range of target resource-limited environments. To the best of our knowledge, finding a deploy-able configuration, by incrementally increasing the CNN model complexity, as required to match the target's resource capacity, and leveraging the strength of post-processing, is the first of its kind. The results show that a shallow 2-layer CNN with a suitable post-processing can achieve $>$90\% F1-score, and the scores continue to improving for 8-32 layer CNNs, which can be used to profile target constraint environment. The outcome shows that it is possible to design an optimal DL solution with known target performance characteristics and resource (computing capacity, and memory) constraints.
    A Survey on Actionable Knowledge. (arXiv:2301.09317v1 [cs.LG])
    Actionable Knowledge Discovery (AKD) is a crucial aspect of data mining that is gaining popularity and being applied in a wide range of domains. This is because AKD can extract valuable insights and information, also known as knowledge, from large datasets. The goal of this paper is to examine different research studies that focus on various domains and have different objectives. The paper will review and discuss the methods used in these studies in detail. AKD is a process of identifying and extracting actionable insights from data, which can be used to make informed decisions and improve business outcomes. It is a powerful tool for uncovering patterns and trends in data that can be used for various applications such as customer relationship management, marketing, and fraud detection. The research studies reviewed in this paper will explore different techniques and approaches for AKD in different domains, such as healthcare, finance, and telecommunications. The paper will provide a thorough analysis of the current state of AKD in the field and will review the main methods used by various research studies. Additionally, the paper will evaluate the advantages and disadvantages of each method and will discuss any novel or new solutions presented in the field. Overall, this paper aims to provide a comprehensive overview of the methods and techniques used in AKD and the impact they have on different domains.
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.
    Differentially Private Natural Language Models: Recent Advances and Future Directions. (arXiv:2301.09112v1 [cs.CL])
    Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protect privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances on DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then we investigate some existing work on DP-NLP and present its recent developments from two aspects: gradient perturbation based methods and embedding vector perturbation based methods. We also discuss some challenges and future directions of this topic.
    Learning to Reject with a Fixed Predictor: Application to Decontextualization. (arXiv:2301.09044v1 [cs.LG])
    We study the problem of classification with a reject option for a fixed predictor, applicable in natural language processing. \ignore{where many correct labels are often possible} We introduce a new problem formulation for this scenario, and an algorithm minimizing a new surrogate loss function. We provide a complete theoretical analysis of the surrogate loss function with a strong $H$-consistency guarantee. For evaluation, we choose the \textit{decontextualization} task, and provide a manually-labelled dataset of $2\mathord,000$ examples. Our algorithm significantly outperforms the baselines considered, with a $\sim\!\!25\%$ improvement in coverage when halving the error rate, which is only $\sim\!\! 3 \%$ away from the theoretical limit.
    Provable Unrestricted Adversarial Training without Compromise with Generalizability. (arXiv:2301.09069v1 [cs.LG])
    Adversarial training (AT) is widely considered as the most promising strategy to defend against adversarial attacks and has drawn increasing interest from researchers. However, the existing AT methods still suffer from two challenges. First, they are unable to handle unrestricted adversarial examples (UAEs), which are built from scratch, as opposed to restricted adversarial examples (RAEs), which are created by adding perturbations bound by an $l_p$ norm to observed examples. Second, the existing AT methods often achieve adversarial robustness at the expense of standard generalizability (i.e., the accuracy on natural examples) because they make a tradeoff between them. To overcome these challenges, we propose a unique viewpoint that understands UAEs as imperceptibly perturbed unobserved examples. Also, we find that the tradeoff results from the separation of the distributions of adversarial examples and natural examples. Based on these ideas, we propose a novel AT approach called Provable Unrestricted Adversarial Training (PUAT), which can provide a target classifier with comprehensive adversarial robustness against both UAE and RAE, and simultaneously improve its standard generalizability. Particularly, PUAT utilizes partially labeled data to achieve effective UAE generation by accurately capturing the natural data distribution through a novel augmented triple-GAN. At the same time, PUAT extends the traditional AT by introducing the supervised loss of the target classifier into the adversarial loss and achieves the alignment between the UAE distribution, the natural data distribution, and the distribution learned by the classifier, with the collaboration of the augmented triple-GAN. Finally, the solid theoretical analysis and extensive experiments conducted on widely-used benchmarks demonstrate the superiority of PUAT.
    Relaxed Models for Adversarial Streaming: The Advice Model and the Bounded Interruptions Model. (arXiv:2301.09203v1 [cs.DS])
    Streaming algorithms are typically analyzed in the oblivious setting, where we assume that the input stream is fixed in advance. Recently, there is a growing interest in designing adversarially robust streaming algorithms that must maintain utility even when the input stream is chosen adaptively and adversarially as the execution progresses. While several fascinating results are known for the adversarial setting, in general, it comes at a very high cost in terms of the required space. Motivated by this, in this work we set out to explore intermediate models that allow us to interpolate between the oblivious and the adversarial models. Specifically, we put forward the following two models: (1) *The advice model*, in which the streaming algorithm may occasionally ask for one bit of advice. (2) *The bounded interruptions model*, in which we assume that the adversary is only partially adaptive. We present both positive and negative results for each of these two models. In particular, we present generic reductions from each of these models to the oblivious model. This allows us to design robust algorithms with significantly improved space complexity compared to what is known in the plain adversarial model.
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.
    Energy Prediction using Federated Learning. (arXiv:2301.09165v1 [cs.LG])
    In this work, we demonstrate the viability of using federated learning to successfully predict energy consumption as well as solar production for all households within a certain network using low-power and low-space consuming embedded devices. We also demonstrate our prediction performance improving over time without the need for sharing private consumer energy data. We simulate a system with four nodes using data for one year to show this.
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.
    Towards NeuroAI: Introducing Neuronal Diversity into Artificial Neural Networks. (arXiv:2301.09245v1 [cs.NE])
    Throughout history, the development of artificial intelligence, particularly artificial neural networks, has been open to and constantly inspired by the increasingly deepened understanding of the brain, such as the inspiration of neocognitron, which is the pioneering work of convolutional neural networks. Per the motives of the emerging field: NeuroAI, a great amount of neuroscience knowledge can help catalyze the next generation of AI by endowing a network with more powerful capabilities. As we know, the human brain has numerous morphologically and functionally different neurons, while artificial neural networks are almost exclusively built on a single neuron type. In the human brain, neuronal diversity is an enabling factor for all kinds of biological intelligent behaviors. Since an artificial network is a miniature of the human brain, introducing neuronal diversity should be valuable in terms of addressing those essential problems of artificial networks such as efficiency, interpretability, and memory. In this Primer, we first discuss the preliminaries of biological neuronal diversity and the characteristics of information transmission and processing in a biological neuron. Then, we review studies of designing new neurons for artificial networks. Next, we discuss what gains can neuronal diversity bring into artificial networks and exemplary applications in several important fields. Lastly, we discuss the challenges and future directions of neuronal diversity to explore the potential of NeuroAI.
    MEMO : Accelerating Transformers with Memoization on Big Memory Systems. (arXiv:2301.09262v1 [cs.PF])
    Transformers gain popularity because of their superior prediction accuracy and inference throughput. However, the transformer is computation-intensive, causing a long inference time. The existing work to accelerate transformer inferences has limitations because of the changes to transformer architectures or the need for specialized hardware. In this paper, we identify the opportunities of using memoization to accelerate the attention mechanism in transformers without the above limitation. Built upon a unique observation that there is a rich similarity in attention computation across inference sequences, we build an attention database upon the emerging big memory system. We introduce the embedding technique to find semantically similar inputs to identify computation similarity. We also introduce a series of techniques such as memory mapping and selective memoization to avoid memory copy and unnecessary overhead. We enable 21% performance improvement on average (up to 68%) with the TB-scale attention database and with ignorable loss in inference accuracy.
    Practical Adversarial Attacks Against AI-Driven Power Allocation in a Distributed MIMO Network. (arXiv:2301.09305v1 [eess.SP])
    In distributed multiple-input multiple-output (D-MIMO) networks, power control is crucial to optimize the spectral efficiencies of users and max-min fairness (MMF) power control is a commonly used strategy as it satisfies uniform quality-of-service to all users. The optimal solution of MMF power control requires high complexity operations and hence deep neural network based artificial intelligence (AI) solutions are proposed to decrease the complexity. Although quite accurate models can be achieved by using AI, these models have some intrinsic vulnerabilities against adversarial attacks where carefully crafted perturbations are applied to the input of the AI model. In this work, we show that threats against the target AI model which might be originated from malicious users or radio units can substantially decrease the network performance by applying a successful adversarial sample, even in the most constrained circumstances. We also demonstrate that the risk associated with these kinds of adversarial attacks is higher than the conventional attack threats. Detailed simulations reveal the effectiveness of adversarial attacks and the necessity of smart defense techniques.
    BallGAN: 3D-aware Image Synthesis with a Spherical Background. (arXiv:2301.09091v1 [cs.CV])
    3D-aware GANs aim to synthesize realistic 3D scenes such that they can be rendered in arbitrary perspectives to produce images. Although previous methods produce realistic images, they suffer from unstable training or degenerate solutions where the 3D geometry is unnatural. We hypothesize that the 3D geometry is underdetermined due to the insufficient constraint, i.e., being classified as real image to the discriminator is not enough. To solve this problem, we propose to approximate the background as a spherical surface and represent a scene as a union of the foreground placed in the sphere and the thin spherical background. It reduces the degree of freedom in the background field. Accordingly, we modify the volume rendering equation and incorporate dedicated constraints to design a novel 3D-aware GAN framework named BallGAN. BallGAN has multiple advantages as follows. 1) It produces more reasonable 3D geometry; the images of a scene across different viewpoints have better photometric consistency and fidelity than the state-of-the-art methods. 2) The training becomes much more stable. 3) The foreground can be separately rendered on top of different arbitrary backgrounds.
    Learning to Linearize Deep Neural Networks for Secure and Efficient Private Inference. (arXiv:2301.09254v1 [cs.CV])
    The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer's activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet's superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ~2x fewer ReLUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ~2.32% improved classification accuracy, evaluated on CIFAR-100.
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.
    Condition monitoring and anomaly detection in cyber-physical systems. (arXiv:2301.09030v1 [cs.LG])
    The modern industrial environment is equipping myriads of smart manufacturing machines where the state of each device can be monitored continuously. Such monitoring can help identify possible future failures and develop a cost-effective maintenance plan. However, it is a daunting task to perform early detection with low false positives and negatives from the huge volume of collected data. This requires developing a holistic machine learning framework to address the issues in condition monitoring of high-priority components and develop efficient techniques to detect anomalies that can detect and possibly localize the faulty components. This paper presents a comparative analysis of recent machine learning approaches for robust, cost-effective anomaly detection in cyber-physical systems. While detection has been extensively studied, very few researchers have analyzed the localization of the anomalies. We show that supervised learning outperforms unsupervised algorithms. For supervised cases, we achieve near-perfect accuracy of 98 percent (specifically for tree-based algorithms). In contrast, the best-case accuracy in the unsupervised cases was 63 percent :the area under the receiver operating characteristic curve (AUC) exhibits similar outcomes as an additional metric.
    Combined Use of Federated Learning and Image Encryption for Privacy-Preserving Image Classification with Vision Transformer. (arXiv:2301.09255v1 [cs.CV])
    In recent years, privacy-preserving methods for deep learning have become an urgent problem. Accordingly, we propose the combined use of federated learning (FL) and encrypted images for privacy-preserving image classification under the use of the vision transformer (ViT). The proposed method allows us not only to train models over multiple participants without directly sharing their raw data but to also protect the privacy of test (query) images for the first time. In addition, it can also maintain the same accuracy as normally trained models. In an experiment, the proposed method was demonstrated to well work without any performance degradation on the CIFAR-10 and CIFAR-100 datasets.
    Statistically Optimal Robust Mean and Covariance Estimation for Anisotropic Gaussians. (arXiv:2301.09024v1 [math.ST])
    Assume that $X_{1}, \ldots, X_{N}$ is an $\varepsilon$-contaminated sample of $N$ independent Gaussian vectors in $\mathbb{R}^d$ with mean $\mu$ and covariance $\Sigma$. In the strong $\varepsilon$-contamination model we assume that the adversary replaced an $\varepsilon$ fraction of vectors in the original Gaussian sample by any other vectors. We show that there is an estimator $\widehat \mu$ of the mean satisfying, with probability at least $1 - \delta$, a bound of the form \[ \|\widehat{\mu} - \mu\|_2 \le c\left(\sqrt{\frac{\operatorname{Tr}(\Sigma)}{N}} + \sqrt{\frac{\|\Sigma\|\log(1/\delta)}{N}} + \varepsilon\sqrt{\|\Sigma\|}\right), \] where $c > 0$ is an absolute constant and $\|\Sigma\|$ denotes the operator norm of $\Sigma$. In the same contaminated Gaussian setup, we construct an estimator $\widehat \Sigma$ of the covariance matrix $\Sigma$ that satisfies, with probability at least $1 - \delta$, \[ \left\|\widehat{\Sigma} - \Sigma\right\| \le c\left(\sqrt{\frac{\|\Sigma\|\operatorname{Tr}(\Sigma)}{N}} + \|\Sigma\|\sqrt{\frac{\log(1/\delta)}{N}} + \varepsilon\|\Sigma\|\right). \] Both results are optimal up to multiplicative constant factors. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, several previously known results were either dimension-dependent and required $\Sigma$ to be close to identity, or had a sub-optimal dependence on the contamination level $\varepsilon$. As a part of the analysis, we derive sharp concentration inequalities for central order statistics of Gaussian, folded normal, and chi-squared distributions.
    Self Reward Design with Fine-grained Interpretability. (arXiv:2112.15034v3 [cs.LG] UPDATED)
    The black-box nature of deep neural networks (DNN) has brought to attention the issues of transparency and fairness. Deep Reinforcement Learning (Deep RL or DRL), which uses DNN to learn its policy, value functions etc, is thus also subject to similar concerns. This paper proposes a way to circumvent the issues through the bottom-up design of neural networks with detailed interpretability, where each neuron or layer has its own meaning and utility that corresponds to humanly understandable concept. The framework introduced in this paper is called the Self Reward Design (SRD), inspired by the Inverse Reward Design, and this interpretable design can (1) solve the problem by pure design (although imperfectly) and (2) be optimized like a standard DNN. With deliberate human designs, we show that some RL problems such as lavaland and MuJoCo can be solved using a model constructed with standard NN components with few parameters. Furthermore, with our fish sale auction example, we demonstrate how SRD is used to address situations that will not make sense if black-box models are used, where humanly-understandable semantic-based decision is required.
    Unifying Synergies between Self-supervised Learning and Dynamic Computation. (arXiv:2301.09164v1 [cs.LG])
    Self-supervised learning (SSL) approaches have made major strides forward by emulating the performance of their supervised counterparts on several computer vision benchmarks. This, however, comes at a cost of substantially larger model sizes, and computationally expensive training strategies, which eventually lead to larger inference times making it impractical for resource constrained industrial settings. Techniques like knowledge distillation (KD), dynamic computation (DC), and pruning are often used to obtain a lightweight sub-network, which usually involves multiple epochs of fine-tuning of a large pre-trained model, making it more computationally challenging. In this work we propose a novel perspective on the interplay between SSL and DC paradigms that can be leveraged to simultaneously learn a dense and gated (sparse/lightweight) sub-network from scratch offering a good accuracy-efficiency trade-off, and therefore yielding a generic and multi-purpose architecture for application specific industrial settings. Our study overall conveys a constructive message: exhaustive experiments on several image classification benchmarks: CIFAR-10, STL-10, CIFAR-100, and ImageNet-100, demonstrates that the proposed training strategy provides a dense and corresponding sparse sub-network that achieves comparable (on-par) performance compared with the vanilla self-supervised setting, but at a significant reduction in computation in terms of FLOPs under a range of target budgets.
    Raw or Cooked? Object Detection on RAW Images. (arXiv:2301.08965v1 [cs.CV])
    Images fed to a deep neural network have in general undergone several handcrafted image signal processing (ISP) operations, all of which have been optimized to produce visually pleasing images. In this work, we investigate the hypothesis that the intermediate representation of visually pleasing images is sub-optimal for downstream computer vision tasks compared to the RAW image representation. We suggest that the operations of the ISP instead should be optimized towards the end task, by learning the parameters of the operations jointly during training. We extend previous works on this topic and propose a new learnable operation that enables an object detector to achieve superior performance when compared to both previous works and traditional RGB images. In experiments on the open PASCALRAW dataset, we empirically confirm our hypothesis.
    Self-Supervised Image Representation Learning: Transcending Masking with Paired Image Overlay. (arXiv:2301.09299v1 [cs.CV])
    Self-supervised learning has become a popular approach in recent years for its ability to learn meaningful representations without the need for data annotation. This paper proposes a novel image augmentation technique, overlaying images, which has not been widely applied in self-supervised learning. This method is designed to provide better guidance for the model to understand underlying information, resulting in more useful representations. The proposed method is evaluated using contrastive learning, a widely used self-supervised learning method that has shown solid performance in downstream tasks. The results demonstrate the effectiveness of the proposed augmentation technique in improving the performance of self-supervised models.
    Improving Deep Neural Network Classification Confidence using Heatmap-based eXplainable AI. (arXiv:2201.00009v3 [cs.LG] UPDATED)
    This paper quantifies the quality of heatmap-based eXplainable AI (XAI) methods w.r.t image classification problem. Here, a heatmap is considered desirable if it improves the probability of predicting the correct classes. Different XAI heatmap-based methods are empirically shown to improve classification confidence to different extents depending on the datasets, e.g. Saliency works best on ImageNet and Deconvolution on Chest X-Ray Pneumonia dataset. The novelty includes a new gap distribution that shows a stark difference between correct and wrong predictions. Finally, the generative augmentative explanation is introduced, a method to generate heatmaps capable of improving predictive confidence to a high level.
    Adapting a Language Model While Preserving its General Knowledge. (arXiv:2301.08986v1 [cs.CL])
    Domain-adaptive pre-training (or DA-training for short), also known as post-training, aims to train a pre-trained general-purpose language model (LM) using an unlabeled corpus of a particular domain to adapt the LM so that end-tasks in the domain can give improved performances. However, existing DA-training methods are in some sense blind as they do not explicitly identify what knowledge in the LM should be preserved and what should be changed by the domain corpus. This paper shows that the existing methods are suboptimal and proposes a novel method to perform a more informed adaptation of the knowledge in the LM by (1) soft-masking the attention heads based on their importance to best preserve the general knowledge in the LM and (2) contrasting the representations of the general and the full (both general and domain knowledge) to learn an integrated representation with both general and domain-specific knowledge. Experimental results will demonstrate the effectiveness of the proposed approach.
    The Shape of Explanations: A Topological Account of Rule-Based Explanations in Machine Learning. (arXiv:2301.09042v1 [cs.LG])
    Rule-based explanations provide simple reasons explaining the behavior of machine learning classifiers at given points in the feature space. Several recent methods (Anchors, LORE, etc.) purport to generate rule-based explanations for arbitrary or black-box classifiers. But what makes these methods work in general? We introduce a topological framework for rule-based explanation methods and provide a characterization of explainability in terms of the definability of a classifier relative to an explanation scheme. We employ this framework to consider various explanation schemes and argue that the preferred scheme depends on how much the user knows about the domain and the probability measure over the feature space.
    Efficient Training Under Limited Resources. (arXiv:2301.09264v1 [cs.LG])
    Training time budget and size of the dataset are among the factors affecting the performance of a Deep Neural Network (DNN). This paper shows that Neural Architecture Search (NAS), Hyper Parameters Optimization (HPO), and Data Augmentation help DNNs perform much better while these two factors are limited. However, searching for an optimal architecture and the best hyperparameter values besides a good combination of data augmentation techniques under low resources requires many experiments. We present our approach to achieving such a goal in three steps: reducing training epoch time by compressing the model while maintaining the performance compared to the original model, preventing model overfitting when the dataset is small, and performing the hyperparameter tuning. We used NOMAD, which is a blackbox optimization software based on a derivative-free algorithm to do NAS and HPO. Our work achieved an accuracy of 86.0 % on a tiny subset of Mini-ImageNet at the ICLR 2021 Hardware Aware Efficient Training (HAET) Challenge and won second place in the competition. The competition results can be found at haet2021.github.io/challenge and our source code can be found at github.com/DouniaLakhmiri/ICLR\_HAET2021.
    Pre-text Representation Transfer for Deep Learning with Limited Imbalanced Data : Application to CT-based COVID-19 Detection. (arXiv:2301.08888v1 [eess.IV])
    Annotating medical images for disease detection is often tedious and expensive. Moreover, the available training samples for a given task are generally scarce and imbalanced. These conditions are not conducive for learning effective deep neural models. Hence, it is common to 'transfer' neural networks trained on natural images to the medical image domain. However, this paradigm lacks in performance due to the large domain gap between the natural and medical image data. To address that, we propose a novel concept of Pre-text Representation Transfer (PRT). In contrast to the conventional transfer learning, which fine-tunes a source model after replacing its classification layers, PRT retains the original classification layers and updates the representation layers through an unsupervised pre-text task. The task is performed with (original, not synthetic) medical images, without utilizing any annotations. This enables representation transfer with a large amount of training data. This high-fidelity representation transfer allows us to use the resulting model as a more effective feature extractor. Moreover, we can also subsequently perform the traditional transfer learning with this model. We devise a collaborative representation based classification layer for the case when we leverage the model as a feature extractor. We fuse the output of this layer with the predictions of a model induced with the traditional transfer learning performed over our pre-text transferred model. The utility of our technique for limited and imbalanced data classification problem is demonstrated with an extensive five-fold evaluation for three large-scale models, tested for five different class-imbalance ratios for CT based COVID-19 detection. Our results show a consistent gain over the conventional transfer learning with the proposed method.
    Spatial Attention Kinetic Networks with E(n)-Equivariance. (arXiv:2301.08893v1 [cs.LG])
    Neural networks that are equivariant to rotations, translations, reflections, and permutations on n-dimensional geometric space have shown promise in physical modeling for tasks such as accurately but inexpensively modeling complex potential energy surfaces to guiding the sampling of complex dynamical systems or forecasting their time evolution. Current state-of-the-art methods employ spherical harmonics to encode higher-order interactions among particles, which are computationally expensive. In this paper, we propose a simple alternative functional form that uses neurally parametrized linear combinations of edge vectors to achieve equivariance while still universally approximating node environments. Incorporating this insight, we design spatial attention kinetic networks with E(n)-equivariance, or SAKE, which are competitive in many-body system modeling tasks while being significantly faster.
    The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation. (arXiv:2301.08968v1 [cs.LG])
    Heterogeneity of data distributed across clients limits the performance of global models trained through federated learning, especially in the settings with highly imbalanced class distributions of local datasets. In recent years, personalized federated learning (pFL) has emerged as a potential solution to the challenges presented by heterogeneous data. However, existing pFL methods typically enhance performance of local models at the expense of the global model's accuracy. We propose FedHKD (Federated Hyper-Knowledge Distillation), a novel FL algorithm in which clients rely on knowledge distillation (KD) to train local models. In particular, each client extracts and sends to the server the means of local data representations and the corresponding soft predictions -- information that we refer to as ``hyper-knowledge". The server aggregates this information and broadcasts it to the clients in support of local training. Notably, unlike other KD-based pFL methods, FedHKD does not rely on a public dataset nor it deploys a generative model at the server. We analyze convergence of FedHKD and conduct extensive experiments on visual datasets in a variety of scenarios, demonstrating that FedHKD provides significant improvement in both personalized as well as global model performance compared to state-of-the-art FL methods designed for heterogeneous data settings.
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.
    Classification of Luminal Subtypes in Full Mammogram Images Using Transfer Learning. (arXiv:2301.09282v1 [eess.IV])
    Automatic identification of patients with luminal and non-luminal subtypes during a routine mammography screening can support clinicians in streamlining breast cancer therapy planning. Recent machine learning techniques have shown promising results in molecular subtype classification in mammography; however, they are highly dependent on pixel-level annotations, handcrafted, and radiomic features. In this work, we provide initial insights into the luminal subtype classification in full mammogram images trained using only image-level labels. Transfer learning is applied from a breast abnormality classification task, to finetune a ResNet-18-based luminal versus non-luminal subtype classification task. We present and compare our results on the publicly available CMMD dataset and show that our approach significantly outperforms the baseline classifier by achieving a mean AUC score of 0.6688 and a mean F1 score of 0.6693 on the test dataset. The improvement over baseline is statistically significant, with a p-value of p<0.0001.
    A Semantic Modular Framework for Events Topic Modeling in Social Media. (arXiv:2301.09009v1 [cs.LG])
    The advancement of social media contributes to the growing amount of content they share frequently. This framework provides a sophisticated place for people to report various real-life events. Detecting these events with the help of natural language processing has received researchers' attention, and various algorithms have been developed for this goal. In this paper, we propose a Semantic Modular Model (SMM) consisting of 5 different modules, namely Distributional Denoising Autoencoder, Incremental Clustering, Semantic Denoising, Defragmentation, and Ranking and Processing. The proposed model aims to (1) cluster various documents and ignore the documents that might not contribute to the identification of events, (2) identify more important and descriptive keywords. Compared to the state-of-the-art methods, the results show that the proposed model has a higher performance in identifying events with lower ranks and extracting keywords for more important events in three English Twitter datasets: FACup, SuperTuesday, and USElection. The proposed method outperformed the best reported results in the mean keyword-precision metric by 7.9\%.
    Leveraging Speaker Embeddings with Adversarial Multi-task Learning for Age Group Classification. (arXiv:2301.09058v1 [eess.AS])
    Recently, researchers have utilized neural network-based speaker embedding techniques in speaker-recognition tasks to identify speakers accurately. However, speaker-discriminative embeddings do not always represent speech features such as age group well. In an embedding model that has been highly trained to capture speaker traits, the task of age group classification is closer to speech information leakage. Hence, to improve age group classification performance, we consider the use of speaker-discriminative embeddings derived from adversarial multi-task learning to align features and reduce the domain discrepancy in age subgroups. In addition, we investigated different types of speaker embeddings to learn and generalize the domain-invariant representations for age groups. Experimental results on the VoxCeleb Enrichment dataset verify the effectiveness of our proposed adaptive adversarial network in multi-objective scenarios and leveraging speaker embeddings for the domain adaptation task.
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.
    Is Nash Equilibrium Approximator Learnable?. (arXiv:2108.07472v5 [cs.GT] UPDATED)
    In this paper, we investigate the learnability of the function approximator that approximates Nash equilibrium (NE) for games generated from a distribution. First, we offer a generalization bound using the Probably Approximately Correct (PAC) learning model. The bound describes the gap between the expected loss and empirical loss of the NE approximator. Afterward, we prove the agnostic PAC learnability of the Nash approximator. In addition to theoretical analysis, we demonstrate an application of NE approximator in experiments. The trained NE approximator can be used to warm-start and accelerate classical NE solvers. Together, our results show the practicability of approximating NE through function approximation.
    Regeneration Learning: A Learning Paradigm for Data Generation. (arXiv:2301.08846v1 [cs.LG])
    Machine learning methods for conditional data generation usually build a mapping from source conditional data X to target data Y. The target Y (e.g., text, speech, music, image, video) is usually high-dimensional and complex, and contains information that does not exist in source data, which hinders effective and efficient learning on the source-target mapping. In this paper, we present a learning paradigm called regeneration learning for data generation, which first generates Y' (an abstraction/representation of Y) from X and then generates Y from Y'. During training, Y' is obtained from Y through either handcrafted rules or self-supervised learning and is used to learn X-->Y' and Y'-->Y. Regeneration learning extends the concept of representation learning to data generation tasks, and can be regarded as a counterpart of traditional representation learning, since 1) regeneration learning handles the abstraction (Y') of the target data Y for data generation while traditional representation learning handles the abstraction (X') of source data X for data understanding; 2) both the processes of Y'-->Y in regeneration learning and X-->X' in representation learning can be learned in a self-supervised way (e.g., pre-training); 3) both the mappings from X to Y' in regeneration learning and from X' to Y in representation learning are simpler than the direct mapping from X to Y. We show that regeneration learning can be a widely-used paradigm for data generation (e.g., text generation, speech recognition, speech synthesis, music composition, image generation, and video generation) and can provide valuable insights into developing data generation methods.
    Debiasing the Cloze Task in Sequential Recommendation with Bidirectional Transformers. (arXiv:2301.09210v1 [cs.LG])
    Bidirectional Transformer architectures are state-of-the-art sequential recommendation models that use a bi-directional representation capacity based on the Cloze task, a.k.a. Masked Language Modeling. The latter aims to predict randomly masked items within the sequence. Because they assume that the true interacted item is the most relevant one, an exposure bias results, where non-interacted items with low exposure propensities are assumed to be irrelevant. The most common approach to mitigating exposure bias in recommendation has been Inverse Propensity Scoring (IPS), which consists of down-weighting the interacted predictions in the loss function in proportion to their propensities of exposure, yielding a theoretically unbiased learning. In this work, we argue and prove that IPS does not extend to sequential recommendation because it fails to account for the temporal nature of the problem. We then propose a novel propensity scoring mechanism, which can theoretically debias the Cloze task in sequential recommendation. Finally we empirically demonstrate the debiasing capabilities of our proposed approach and its robustness to the severity of exposure bias.
    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.
    DeepFEL: Deep Fastfood Ensemble Learning for Histopathology Image Analysis. (arXiv:2301.09525v1 [eess.IV])
    Computational pathology tasks have some unique characterises such as multi-gigapixel images, tedious and frequently uncertain annotations, and unavailability of large number of cases [13]. To address some of these issues, we present Deep Fastfood Ensembles - a simple, fast and yet effective method for combining deep features pooled from popular CNN models pre-trained on totally different source domains (e.g., natural image objects) and projected onto diverse dimensions using random projections, the so-called Fastfood [11]. The final ensemble output is obtained by a consensus of simple individual classifiers, each of which is trained on a different collection of random basis vectors. This offers extremely fast and yet effective solution, especially when training times and domain labels are of the essence. We demonstrate the effectiveness of the proposed deep fastfood ensemble learning as compared to the state-of-the-art methods for three different tasks in histopathology image analysis.
    Accelerating Fair Federated Learning: Adaptive Federated Adam. (arXiv:2301.09357v1 [cs.LG])
    Federated learning is a distributed and privacy-preserving approach to train a statistical model collaboratively from decentralized data of different parties. However, when datasets of participants are not independent and identically distributed (non-IID), models trained by naive federated algorithms may be biased towards certain participants, and model performance across participants is non-uniform. This is known as the fairness problem in federated learning. In this paper, we formulate fairness-controlled federated learning as a dynamical multi-objective optimization problem to ensure fair performance across all participants. To solve the problem efficiently, we study the convergence and bias of Adam as the server optimizer in federated learning, and propose Adaptive Federated Adam (AdaFedAdam) to accelerate fair federated learning with alleviated bias. We validated the effectiveness, Pareto optimality and robustness of AdaFedAdam in numerical experiments and show that AdaFedAdam outperforms existing algorithms, providing better convergence and fairness properties of the federated scheme.
    A Structural Approach to the Design of Domain Specific Neural Network Architectures. (arXiv:2301.09381v1 [cs.LG])
    This is a master's thesis concerning the theoretical ideas of geometric deep learning. Geometric deep learning aims to provide a structured characterization of neural network architectures, specifically focused on the ideas of invariance and equivariance of data with respect to given transformations. This thesis aims to provide a theoretical evaluation of geometric deep learning, compiling theoretical results that characterize the properties of invariant neural networks with respect to learning performance.
    HALOC: Hardware-Aware Automatic Low-Rank Compression for Compact Neural Networks. (arXiv:2301.09422v1 [cs.LG])
    Low-rank compression is an important model compression strategy for obtaining compact neural network models. In general, because the rank values directly determine the model complexity and model accuracy, proper selection of layer-wise rank is very critical and desired. To date, though many low-rank compression approaches, either selecting the ranks in a manual or automatic way, have been proposed, they suffer from costly manual trials or unsatisfied compression performance. In addition, all of the existing works are not designed in a hardware-aware way, limiting the practical performance of the compressed models on real-world hardware platforms. To address these challenges, in this paper we propose HALOC, a hardware-aware automatic low-rank compression framework. By interpreting automatic rank selection from an architecture search perspective, we develop an end-to-end solution to determine the suitable layer-wise ranks in a differentiable and hardware-aware way. We further propose design principles and mitigation strategy to efficiently explore the rank space and reduce the potential interference problem. Experimental results on different datasets and hardware platforms demonstrate the effectiveness of our proposed approach. On CIFAR-10 dataset, HALOC enables 0.07% and 0.38% accuracy increase over the uncompressed ResNet-20 and VGG-16 models with 72.20% and 86.44% fewer FLOPs, respectively. On ImageNet dataset, HALOC achieves 0.9% higher top-1 accuracy than the original ResNet-18 model with 66.16% fewer FLOPs. HALOC also shows 0.66% higher top-1 accuracy increase than the state-of-the-art automatic low-rank compression solution with fewer computational and memory costs. In addition, HALOC demonstrates the practical speedups on different hardware platforms, verified by the measurement results on desktop GPU, embedded GPU and ASIC accelerator.
    StockEmotions: Discover Investor Emotions for Financial Sentiment Analysis and Multivariate Time Series. (arXiv:2301.09279v1 [cs.CL])
    There has been growing interest in applying NLP techniques in the financial domain, however, resources are extremely limited. This paper introduces StockEmotions, a new dataset for detecting emotions in the stock market that consists of 10,000 English comments collected from StockTwits, a financial social media platform. Inspired by behavioral finance, it proposes 12 fine-grained emotion classes that span the roller coaster of investor emotion. Unlike existing financial sentiment datasets, StockEmotions presents granular features such as investor sentiment classes, fine-grained emotions, emojis, and time series data. To demonstrate the usability of the dataset, we perform a dataset analysis and conduct experimental downstream tasks. For financial sentiment/emotion classification tasks, DistilBERT outperforms other baselines, and for multivariate time series forecasting, a Temporal Attention LSTM model combining price index, text, and emotion features achieves the best performance than using a single feature.
    Logical Message Passing Networks with One-hop Inference on Atomic Formulas. (arXiv:2301.08859v1 [cs.LG])
    Complex Query Answering (CQA) over Knowledge Graphs (KGs) has attracted a lot of attention to potentially support many applications. Given that KGs are usually incomplete, neural models are proposed to answer logical queries by parameterizing set operators with complex neural networks. However, such methods usually train neural set operators with a large number of entity and relation embeddings from zero, where whether and how the embeddings or the neural set operators contribute to the performance remains not clear. In this paper, we propose a simple framework for complex query answering that decomposes the KG embeddings from neural set operators. We propose to represent the complex queries in the query graph. On top of the query graph, we propose the Logical Message Passing Neural Network (LMPNN) that connects the \textit{local} one-hop inferences on atomic formulas to the \textit{global} logical reasoning for complex query answering. We leverage existing effective KG embeddings to conduct one-hop inferences on atomic formulas, the results of which are regarded as the messages passed in LMPNN. The reasoning process over the overall logical formulas is turned into the forward pass of LMPNN that incrementally aggregates local information to predict the answers' embeddings finally. The complex logical inference across different types of queries will then be learned from training examples based on the LMPNN architecture. Theoretically, our query-graph representation is more general than the prevailing operator-tree formulation, so our approach applies to a broader range of complex KG queries. Empirically, our approach yields a new state-of-the-art neural CQA model. Our research bridges the gap between complex KG query answering tasks and the long-standing achievements of knowledge graph representation learning.  ( 2 min )
    Impact of PCA-based preprocessing and different CNN structures on deformable registration of sonograms. (arXiv:2301.08802v1 [cs.CV])
    Central venous catheters (CVC) are commonly inserted into the large veins of the neck, e.g. the internal jugular vein (IJV). CVC insertion may cause serious complications like misplacement into an artery or perforation of cervical vessels. Placing a CVC under sonographic guidance is an appropriate method to reduce such adverse events, if anatomical landmarks like venous and arterial vessels can be detected reliably. This task shall be solved by registration of patient individual images vs. an anatomically labelled reference image. In this work, a linear, affine transformation is performed on cervical sonograms, followed by a non-linear transformation to achieve a more precise registration. Voxelmorph (VM), a learning-based library for deformable image registration using a convolutional neural network (CNN) with U-Net structure was used for non-linear transformation. The impact of principal component analysis (PCA)-based pre-denoising of patient individual images, as well as the impact of modified net structures with differing complexities on registration results were examined visually and quantitatively, the latter using metrics for deformation and image similarity. Using the PCA-approximated cervical sonograms resulted in decreased mean deformation lengths between 18% and 66% compared to their original image counterparts, depending on net structure. In addition, reducing the number of convolutional layers led to improved image similarity with PCA images, while worsening in original images. Despite a large reduction of network parameters, no overall decrease in registration quality was observed, leading to the conclusion that the original net structure is oversized for the task at hand.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Fast likelihood-based change point detection. (arXiv:2301.08892v1 [cs.LG])
    Change point detection plays a fundamental role in many real-world applications, where the goal is to analyze and monitor the behaviour of a data stream. In this paper, we study change detection in binary streams. To this end, we use a likelihood ratio between two models as a measure for indicating change. The first model is a single bernoulli variable while the second model divides the stored data in two segments, and models each segment with its own bernoulli variable. Finding the optimal split can be done in $O(n)$ time, where $n$ is the number of entries since the last change point. This is too expensive for large $n$. To combat this we propose an approximation scheme that yields $(1 - \epsilon)$ approximation in $O(\epsilon^{-1} \log^2 n)$ time. The speed-up consists of several steps: First we reduce the number of possible candidates by adopting a known result from segmentation problems. We then show that for fixed bernoulli parameters we can find the optimal change point in logarithmic time. Finally, we show how to construct a candidate list of size $O(\epsilon^{-1} \log n)$ for model parameters. We demonstrate empirically the approximation quality and the running time of our algorithm, showing that we can gain a significant speed-up with a minimal average loss in optimality.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Improving Deep Regression with Ordinal Entropy. (arXiv:2301.08915v1 [cs.CV])
    In computer vision, it is often observed that formulating regression problems as a classification task often yields better performance. We investigate this curious phenomenon and provide a derivation to show that classification, with the cross-entropy loss, outperforms regression with a mean squared error loss in its ability to learn high-entropy feature representations. Based on the analysis, we propose an ordinal entropy loss to encourage higher-entropy feature spaces while maintaining ordinal relationships to improve the performance of regression tasks. Experiments on synthetic and real-world regression tasks demonstrate the importance and benefits of increasing entropy for regression.  ( 2 min )
    Developing Hybrid Machine Learning Models to Assign Health Score to Railcar Fleets for Optimal Decision Making. (arXiv:2301.08877v1 [cs.LG])
    A large amount of data is generated during the operation of a railcar fleet, which can easily lead to dimensional disaster and reduce the resiliency of the railcar network. To solve these issues and offer predictive maintenance, this research introduces a hybrid fault diagnosis expert system method that combines density-based spatial clustering of applications with noise (DBSCAN) and principal component analysis (PCA). Firstly, the DBSCAN method is used to cluster categorical data that are similar to one another within the same group. Secondly, PCA algorithm is applied to reduce the dimensionality of the data and eliminate redundancy in order to improve the accuracy of fault diagnosis. Finally, we explain the engineered features and evaluate the selected models by using the Gain Chart and Area Under Curve (AUC) metrics. We use the hybrid expert system model to enhance maintenance planning decisions by assigning a health score to the railcar system of the North American Railcar Owner (NARO). According to the experimental results, our expert model can detect 96.4% of failures within 50% of the sample. This suggests that our method is effective at diagnosing failures in railcars fleet.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    A Communication-Efficient Adaptive Algorithm for Federated Learning under Cumulative Regret. (arXiv:2301.08869v1 [cs.LG])
    We consider the problem of online stochastic optimization in a distributed setting with $M$ clients connected through a central server. We develop a distributed online learning algorithm that achieves order-optimal cumulative regret with low communication cost measured in the total number of bits transmitted over the entire learning horizon. This is in contrast to existing studies which focus on the offline measure of simple regret for learning efficiency. The holistic measure for communication cost also departs from the prevailing approach that \emph{separately} tackles the communication frequency and the number of bits in each communication round.  ( 2 min )
    Rationalization for Explainable NLP: A Survey. (arXiv:2301.08912v1 [cs.CL])
    Recent advances in deep learning have improved the performance of many Natural Language Processing (NLP) tasks such as translation, question-answering, and text classification. However, this improvement comes at the expense of model explainability. Black-box models make it difficult to understand the internals of a system and the process it takes to arrive at an output. Numerical (LIME, Shapley) and visualization (saliency heatmap) explainability techniques are helpful; however, they are insufficient because they require specialized knowledge. These factors led rationalization to emerge as a more accessible explainable technique in NLP. Rationalization justifies a model's output by providing a natural language explanation (rationale). Recent improvements in natural language generation have made rationalization an attractive technique because it is intuitive, human-comprehensible, and accessible to non-technical users. Since rationalization is a relatively new field, it is disorganized. As the first survey, rationalization literature in NLP from 2007-2022 is analyzed. This survey presents available methods, explainable evaluations, code, and datasets used across various NLP tasks that use rationalization. Further, a new subfield in Explainable AI (XAI), namely, Rational AI (RAI), is introduced to advance the current state of rationalization. A discussion on observed insights, challenges, and future directions is provided to point to promising research opportunities.  ( 2 min )
    Federated Recommendation with Additive Personalization. (arXiv:2301.09109v1 [cs.LG])
    With rising concerns about privacy, developing recommendation systems in a federated setting become a new paradigm to develop next-generation Internet service architecture. However, existing approaches are usually derived from a distributed recommendation framework with an additional mechanism for privacy protection, thus most of them fail to fully exploit personalization in the new context of federated recommendation settings. In this paper, we propose a novel approach called Federated Recommendation with Additive Personalization (FedRAP) to enhance recommendation by learning user embedding and the user's personal view of item embeddings. Specifically, the proposed additive personalization is to add a personalized item embedding to a sparse global item embedding aggregated from all users. Moreover, a curriculum learning mechanism has been applied for additive personalization on item embeddings by gradually increasing regularization weights to mitigate the performance degradation caused by large variances among client-specific item embeddings. A unified formulation has been proposed with a sparse regularization of global item embeddings for reducing communication overhead. Experimental results on four real-world recommendation datasets demonstrate the effectiveness of FedRAP.  ( 2 min )
    Slice Transformer and Self-supervised Learning for 6DoF Localization in 3D Point Cloud Maps. (arXiv:2301.08957v1 [cs.CV])
    Precise localization is critical for autonomous vehicles. We present a self-supervised learning method that employs Transformers for the first time for the task of outdoor localization using LiDAR data. We propose a pre-text task that reorganizes the slices of a $360^\circ$ LiDAR scan to leverage its axial properties. Our model, called Slice Transformer, employs multi-head attention while systematically processing the slices. To the best of our knowledge, this is the first instance of leveraging multi-head attention for outdoor point clouds. We additionally introduce the Perth-WA dataset, which provides a large-scale LiDAR map of Perth city in Western Australia, covering $\sim$4km$^2$ area. Localization annotations are provided for Perth-WA. The proposed localization method is thoroughly evaluated on Perth-WA and Appollo-SouthBay datasets. We also establish the efficacy of our self-supervised learning approach for the common downstream task of object classification using ModelNet40 and ScanNN datasets. The code and Perth-WA data will be publicly released.  ( 2 min )
    Estimation of Sea State Parameters from Ship Motion Responses Using Attention-based Neural Networks. (arXiv:2301.08949v1 [cs.LG])
    On-site estimation of sea state parameters is crucial for ship navigation systems' accuracy, stability, and efficiency. Extensive research has been conducted on model-based estimating methods utilizing only ship motion responses. Model-free approaches based on machine learning (ML) have recently gained popularity, and estimation from time-series of ship motion responses using deep learning (DL) methods has given promising results. Accordingly, in this study, we apply the novel, attention-based neural network (AT-NN) for estimating sea state parameters (wave height, zero-crossing period, and relative wave direction) from raw time-series data of ship pitch, heave, and roll motions. Despite using reduced input data, it has been successfully demonstrated that the proposed approaches by modified state-of-the-art techniques (based on convolutional neural networks (CNN) for regression, multivariate long short-term memory CNN, and sliding puzzle neural network) reduced estimation MSE by 23% and MAE by 16% compared to the original methods. Furthermore, the proposed technique based on AT-NN outperformed all tested methods (original and enhanced), reducing estimation MSE by up to 94% and MAE by up to 70%. Finally, we also proposed a novel approach for interpreting the uncertainty estimation of neural network outputs based on the Monte-Carlo dropout method to enhance the model's trustworthiness.  ( 2 min )
    Soft Sensing Regression Model: from Sensor to Wafer Metrology Forecasting. (arXiv:2301.08974v1 [cs.LG])
    The semiconductor industry is one of the most technology-evolving and capital-intensive market sectors. Effective inspection and metrology are necessary to improve product yield, increase product quality and reduce costs. In recent years, many semiconductor manufacturing equipments are equipped with sensors to facilitate real-time monitoring of the production process. These production-state and equipment-state sensor data provide an opportunity to practice machine-learning technologies in various domains, such as anomaly/fault detection, maintenance scheduling, quality prediction, etc. In this work, we focus on the task of soft sensing regression, which uses sensor data to predict impending inspection measurements that used to be measured in wafer inspection and metrology systems. We proposed an LSTM-based regressor and designed two loss functions for model training. Although engineers may look at our prediction errors in a subjective manner, a new piece-wise evaluation metric was proposed for assessing model accuracy in a mathematical way. The experimental results demonstrated that the proposed model can achieve accurate and early prediction of various types of inspections in complicated manufacturing processes.  ( 2 min )
    Bayesian Hierarchical Models for Counterfactual Estimation. (arXiv:2301.08833v1 [cs.LG])
    Counterfactual explanations utilize feature perturbations to analyze the outcome of an original decision and recommend an actionable recourse. We argue that it is beneficial to provide several alternative explanations rather than a single point solution and propose a probabilistic paradigm to estimate a diverse set of counterfactuals. Specifically, we treat the perturbations as random variables endowed with prior distribution functions. This allows sampling multiple counterfactuals from the posterior density, with the added benefit of incorporating inductive biases, preserving domain specific constraints and quantifying uncertainty in estimates. More importantly, we leverage Bayesian hierarchical modeling to share information across different subgroups of a population, which can both improve robustness and measure fairness. A gradient based sampler with superior convergence characteristics efficiently computes the posterior samples. Experiments across several datasets demonstrate that the counterfactuals estimated using our approach are valid, sparse, diverse and feasible.  ( 2 min )
    Dense RGB SLAM with Neural Implicit Maps. (arXiv:2301.08930v1 [cs.CV])
    There is an emerging trend of using neural implicit functions for map representation in Simultaneous Localization and Mapping (SLAM). Some pioneer works have achieved encouraging results on RGB-D SLAM. In this paper, we present a dense RGB SLAM method with neural implicit map representation. To reach this challenging goal without depth input, we introduce a hierarchical feature volume to facilitate the implicit map decoder. This design effectively fuses shape cues across different scales to facilitate map reconstruction. Our method simultaneously solves the camera motion and the neural implicit map by matching the rendered and input video frames. To facilitate optimization, we further propose a photometric warping loss in the spirit of multi-view stereo to better constrain the camera pose and scene geometry. We evaluate our method on commonly used benchmarks and compare it with modern RGB and RGB-D SLAM systems. Our method achieves favorable results than previous methods and even surpasses some recent RGB-D SLAM methods. Our source code will be publicly available.  ( 2 min )
    SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis. (arXiv:2301.09201v1 [cs.IR])
    5G is the 5th generation cellular network protocol. It is the state-of-the-art global wireless standard that enables an advanced kind of network designed to connect virtually everyone and everything with increased speed and reduced latency. Therefore, its development, analysis, and security are critical. However, all approaches to the 5G protocol development and security analysis, e.g., property extraction, protocol summarization, and semantic analysis of the protocol specifications and implementations are completely manual. To reduce such manual effort, in this paper, we curate SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains 3,547,586 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. By leveraging large-scale pre-trained language models that have achieved state-of-the-art results on NLP tasks, we use this dataset for security-related text classification and summarization. Security-related text classification can be used to extract relevant security-related properties for protocol testing. On the other hand, summarization can help developers and practitioners understand the high level of the protocol, which is itself a daunting task. Our results show the value of our 5G-centric dataset in 5G protocol analysis automation. We believe that SPEC5G will enable a new research direction into automatic analyses for the 5G cellular network protocol and numerous related downstream tasks. Our data and code are publicly available.
    Is Signed Message Essential for Graph Neural Networks?. (arXiv:2301.08918v1 [cs.LG])
    Message-passing Graph Neural Networks (GNNs), which collect information from adjacent nodes, achieve satisfying results on homophilic graphs. However, their performances are dismal in heterophilous graphs, and many researchers have proposed a plethora of schemes to solve this problem. Especially, flipping the sign of edges is rooted in a strong theoretical foundation, and attains significant performance enhancements. Nonetheless, previous analyses assume a binary class scenario and they may suffer from confined applicability. This paper extends the prior understandings to multi-class scenarios and points out two drawbacks: (1) the sign of multi-hop neighbors depends on the message propagation paths and may incur inconsistency, (2) it also increases the prediction uncertainty (e.g., conflict evidence) which can impede the stability of the algorithm. Based on the theoretical understanding, we introduce a novel strategy that is applicable to multi-class graphs. The proposed scheme combines confidence calibration to secure robustness while reducing uncertainty. We show the efficacy of our theorem through extensive experiments on six benchmark graph datasets.  ( 2 min )
    On the Algebraic Properties of Flame Graphs. (arXiv:2301.08941v1 [cs.SE])
    Flame graphs are a popular way of representing profiling data. In this paper we propose a possible mathematical definition of flame graphs. In doing so, we gain some interesting algebraic properties almost for free, which in turn allow us to define some operations that can allow to perform an in-depth performance regression analysis. The typical documented use of a flame graph is via its graphical representation, whereby one scans the picture for the largest plateaux. Whilst this method is effective at finding the main sources of performance issues, it leaves quite a large amount of data potentially unused. By combining a mathematical precise definition of flame graphs with some statistical methods we show how to generalise this visual procedure and make the best of the full set of collected profiling data.  ( 2 min )
    ScaDLES: Scalable Deep Learning over Streaming data at the Edge. (arXiv:2301.08897v1 [cs.DC])
    Distributed deep learning (DDL) training systems are designed for cloud and data-center environments that assumes homogeneous compute resources, high network bandwidth, sufficient memory and storage, as well as independent and identically distributed (IID) data across all nodes. However, these assumptions don't necessarily apply on the edge, especially when training neural networks on streaming data in an online manner. Computing on the edge suffers from both systems and statistical heterogeneity. Systems heterogeneity is attributed to differences in compute resources and bandwidth specific to each device, while statistical heterogeneity comes from unbalanced and skewed data on the edge. Different streaming-rates among devices can be another source of heterogeneity when dealing with streaming data. If the streaming rate is lower than training batch-size, device needs to wait until enough samples have streamed in before performing a single iteration of stochastic gradient descent (SGD). Thus, low-volume streams act like stragglers slowing down devices with high-volume streams in synchronous training. On the other hand, data can accumulate quickly in the buffer if the streaming rate is too high and the devices can't train at line-rate. In this paper, we introduce ScaDLES to efficiently train on streaming data at the edge in an online fashion, while also addressing the challenges of limited bandwidth and training with non-IID data. We empirically show that ScaDLES converges up to 3.29 times faster compared to conventional distributed SGD.
    Cellular Network Speech Enhancement: Removing Background and Transmission Noise. (arXiv:2301.09027v1 [cs.SD])
    The primary objective of speech enhancement is to reduce background noise while preserving the target's speech. A common dilemma occurs when a speaker is confined to a noisy environment and receives a call with high background and transmission noise. To address this problem, the Deep Noise Suppression (DNS) Challenge focuses on removing the background noise with the next-generation deep learning models to enhance the target's speech; however, researchers fail to consider Voice Over IP (VoIP) applications their transmission noise. Focusing on Google Meet and its cellular application, our work achieves state-of-the-art performance on the Google Meet To Phone Track of the VoIP DNS Challenge. This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
    Ti-MAE: Self-Supervised Masked Time Series Autoencoders. (arXiv:2301.08871v1 [cs.LG])
    Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformer-based models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformer-based models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.
    Versatile Neural Processes for Learning Implicit Neural Representations. (arXiv:2301.08883v1 [cs.LG])
    Representing a signal as a continuous function parameterized by neural network (a.k.a. Implicit Neural Representations, INRs) has attracted increasing attention in recent years. Neural Processes (NPs), which model the distributions over functions conditioned on partial observations (context set), provide a practical solution for fast inference of continuous functions. However, existing NP architectures suffer from inferior modeling capability for complex signals. In this paper, we propose an efficient NP framework dubbed Versatile Neural Processes (VNP), which largely increases the capability of approximating functions. Specifically, we introduce a bottleneck encoder that produces fewer and informative context tokens, relieving the high computational cost while providing high modeling capability. At the decoder side, we hierarchically learn multiple global latent variables that jointly model the global structure and the uncertainty of a function, enabling our model to capture the distribution of complex signals. We demonstrate the effectiveness of the proposed VNP on a variety of tasks involving 1D, 2D and 3D signals. Particularly, our method shows promise in learning accurate INRs w.r.t. a 3D scene without further finetuning.  ( 2 min )
    Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions. (arXiv:2203.12235v3 [cs.CL] UPDATED)
    The syntactic categories of categorial grammar formalisms are structured units made of smaller, indivisible primitives, bound together by the underlying grammar's category formation rules. In the trending approach of constructive supertagging, neural models are increasingly made aware of the internal category structure, which in turn enables them to more reliably predict rare and out-of-vocabulary categories, with significant implications for grammars previously deemed too complex to find practical use. In this work, we revisit constructive supertagging from a graph-theoretic perspective, and propose a framework based on heterogeneous dynamic graph convolutions aimed at exploiting the distinctive structure of a supertagger's output space. We test our approach on a number of categorial grammar datasets spanning different languages and grammar formalisms, achieving substantial improvements over previous state of the art scores. Code will be made available at https://github.com/konstantinosKokos/dynamic-graph-supertagging  ( 2 min )
    Comparing different subgradient methods for solving convex optimization problems with functional constraints. (arXiv:2101.01045v2 [math.OC] UPDATED)
    We consider the problem of minimizing a convex, nonsmooth function subject to a closed convex constraint domain. The methods that we propose are reforms of subgradient methods based on Metel--Takeda's paper [Optimization Letters 15.4 (2021): 1491-1504] and Boyd's works [Lecture notes of EE364b, Stanford University, Spring 2013-14, pp. 1-39]. While the former has complexity $\mathcal{O}(\varepsilon^{-2r})$ for all $r> 1$, the complexity of the latter is $\mathcal{O}(\varepsilon^{-2})$. We perform some comparisons between these two methods using several test examples.  ( 2 min )
    Problem-dependent attention and effort in neural networks with application to image resolution and model selection. (arXiv:2201.01415v3 [cs.CV] UPDATED)
    This paper introduces a new ensemble-based approach to reduce the data and computation costs of accurate classification. When faced with a new test case, a low cost classifier is used first, only moving to a higher cost approach if the initial classifier does not have a high degree of confidence in its projection. This multi-stage strategy can be used with any set of classifiers and does not require additional training. The approach is first applied to reduce the amount of data required to classify test images; it is found to be effective for problems in which at least some fraction of cases can be correctly classified based upon coarser data than are typically used. For neural networks performing digit recognition, for example, the proposed approach reduces the number of bytes of data read by 60% to 85% with less than 5% reduction in accuracy. For the ImageNet data, the number of bytes read by the typical network is reduced by 20% with less than 5% reduction in accuracy -- and in some cases, the resource savings reach 40%. The second application is to reduce computational complexity, with simpler neural networks used for test cases that are easier to classify and complex networks used for more difficult cases. For classification both of digits and of ImageNet images, computation cost is reduced by as much as 82% to 89% with less than 5% reduction in accuracy. The results also show that, for situations in which computational cost is not a concern, calculating multiple models' projections and selecting the one from the most confident classifier can increase classification accuracy on ImageNet by as much as two percent over the best standalone classifier considered here.  ( 3 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    Continuous-time identification of dynamic state-space models by deep subspace encoding. (arXiv:2204.09405v2 [cs.LG] UPDATED)
    Continuous-time (CT) modeling has proven to provide improved sample efficiency and interpretability in learning the dynamical behavior of physical systems compared to discrete-time (DT) models. However, even with numerous recent developments, the CT nonlinear state-space (NL-SS) model identification problem remains to be solved in full, considering common experimental aspects such as the presence of external inputs, measurement noise, latent states, and general robustness. This paper presents a novel estimation method that addresses all these aspects and that can obtain state-of-the-art results on multiple benchmarks with compact fully connected neural networks capturing the CT dynamics. The proposed estimation method called the subspace encoder approach (SUBNET) ascertains these results by efficiently approximating the complete simulation loss by evaluating short simulations on subsections of the data, by using an encoder function to estimate the initial state for each subsection and a novel state-derivative normalization to ensure stability and good numerical conditioning of the training process. We prove that the use of subsections increases cost function smoothness together with the necessary requirements for the existence of the encoder function and we show that the proposed state-derivative normalization is essential for reliable estimation of CT NL-SS models.  ( 2 min )
    Explainable Multilayer Graph Neural Network for Cancer Gene Prediction. (arXiv:2301.08831v1 [cs.LG])
    The identification of cancer genes is a critical, yet challenging problem in cancer genomics research. Recently, several computational methods have been developed to address this issue, including deep neural networks. However, these methods fail to exploit the multilayered gene-gene interactions and provide little to no explanation for their predictions. Results: In this study, we propose an Explainable Multilayer Graph Neural Network (EMGNN) approach to identify cancer genes, by leveraging multiple gene-gene interaction networks and multi-omics data. Compared to conventional graph learning methods, EMGNN learned complementary information in multiple graphs to accurately predict cancer genes. Our method consistently outperforms existing approaches while providing valuable biological insights into its predictions. We further release our novel cancer gene predictions and connect them with known cancer patterns, aiming to accelerate the progress of cancer research  ( 2 min )
    Limitations of Piecewise Linearity for Efficient Robustness Certification. (arXiv:2301.08842v1 [cs.LG])
    Certified defenses against small-norm adversarial examples have received growing attention in recent years; though certified accuracies of state-of-the-art methods remain far below their non-robust counterparts, despite the fact that benchmark datasets have been shown to be well-separated at far larger radii than the literature generally attempts to certify. In this work, we offer insights that identify potential factors in this performance gap. Specifically, our analysis reveals that piecewise linearity imposes fundamental limitations on the tightness of leading certification techniques. These limitations are felt in practical terms as a greater need for capacity in models hoped to be certified efficiently. Moreover, this is in addition to the capacity necessary to learn a robust boundary, studied in prior work. However, we argue that addressing the limitations of piecewise linearity through scaling up model capacity may give rise to potential difficulties -- particularly regarding robust generalization -- therefore, we conclude by suggesting that developing smooth activation functions may be the way forward for advancing the performance of certified neural networks.  ( 2 min )
    Dynamic MLP for MRI Reconstruction. (arXiv:2301.08868v1 [eess.IV])
    As convolutional neural networks (CNN) become the most successful reconstruction technique for accelerated Magnetic Resonance Imaging (MRI), CNN reaches its limit on image quality especially in sharpness. Further improvement on image quality often comes at massive computational costs, hindering their practicability in the clinic setting. MRI reconstruction is essentially a deconvolution problem, which demands long-distance information that is difficult to be captured by CNNs with small convolution kernels. The multi-layer perceptron (MLP) is able to model such long-distance information, but it restricts a fixed input size while the reconstruction of images in flexible resolutions is required in the clinic setting. In this paper, we proposed a hybrid CNN and MLP reconstruction strategy, featured by dynamic MLP (dMLP) that accepts arbitrary image sizes. Experiments were conducted using 3D multi-coil MRI. Our results suggested the proposed dMLP can improve image sharpness compared to its pure CNN counterpart, while costing minor additional GPU memory and computation time. We further compared the proposed dMLP with CNNs using large kernels and studied pure MLP-based reconstruction using a stack of 1D dMLPs, as well as its CNN counterpart using only 1D convolutions. We observed the enlarged receptive field has noticeably improved image quality, while simply using CNN with a large kernel leads to difficulties in training. Noticeably, the pure MLP-based method has been outperformed by CNN-involved methods, which matches the observations in other computer vision tasks for natural images.  ( 2 min )
    Computing equilibria by minimizing exploitability with best-response ensembles. (arXiv:2301.08830v1 [cs.GT])
    In this paper, we study the problem of computing an approximate Nash equilibrium of a continuous game. Such games naturally model many situations involving space, time, money, and other fine-grained resources or quantities. The standard measure of the closeness of a strategy profile to Nash equilibrium is exploitability, which measures how much utility players can gain from changing their strategy unilaterally. We introduce a new equilibrium-finding method that minimizes an approximation of the exploitability. This approximation employs a best-response ensemble for each player that maintains multiple candidate best responses for that player. In each iteration, the best-performing element of each ensemble is used in a gradient-based scheme to update the current strategy profile. The strategy profile and best-response ensembles are simultaneously trained to minimize and maximize the approximate exploitability, respectively. Experiments on a suite of benchmark games show that it outperforms previous methods.  ( 2 min )
    Towards Flexibility and Interpretability of Gaussian Process State-Space Model. (arXiv:2301.08843v1 [cs.LG])
    Gaussian process state-space model (GPSSM) has attracted much attention over the past decade. However, the model representation power of GPSSM is far from satisfactory. Most GPSSM works rely on the standard Gaussian process (GP) with a preliminary kernel, such as squared exponential (SE) kernel and Mat\'{e}rn kernel, which limit the model representation power and its application in complex scenarios. To address this issue, this paper proposes a novel class of probabilistic state-space model named TGPSSM that enriches the GP priors in the standard GPSSM through parametric normalizing flow, making the state-space model more flexible and expressive. In addition, by inheriting the advantages of sparse representation of GP models, we propose a scalable and interpretable variational learning algorithm to learn the TGPSSM and infer the latent dynamics simultaneously. By integrating a constrained optimization framework and explicitly constructing a non-Gaussian state variational distribution, the proposed learning algorithm enables the TGPSSM to significantly improve the capabilities of state space representation and model inference. Experimental results based on various synthetic and real datasets corroborate that the proposed TGPSSM yields superior learning and inference performance compared to several state-of-the-art methods. The accompanying source code is available at https://github.com/zhidilin/TGPSSM.  ( 2 min )
    Split Ways: Privacy-Preserving Training of Encrypted Data Using Split Learning. (arXiv:2301.08778v1 [cs.CR])
    Split Learning (SL) is a new collaborative learning technique that allows participants, e.g. a client and a server, to train machine learning models without the client sharing raw data. In this setting, the client initially applies its part of the machine learning model on the raw data to generate activation maps and then sends them to the server to continue the training process. Previous works in the field demonstrated that reconstructing activation maps could result in privacy leakage of client data. In addition to that, existing mitigation techniques that overcome the privacy leakage of SL prove to be significantly worse in terms of accuracy. In this paper, we improve upon previous works by constructing a protocol based on U-shaped SL that can operate on homomorphically encrypted data. More precisely, in our approach, the client applies Homomorphic Encryption (HE) on the activation maps before sending them to the server, thus protecting user privacy. This is an important improvement that reduces privacy leakage in comparison to other SL-based works. Finally, our results show that, with the optimum set of parameters, training with HE data in the U-shaped SL setting only reduces accuracy by 2.65% compared to training on plaintext. In addition, raw training data privacy is preserved.  ( 2 min )
    ManyDG: Many-domain Generalization for Healthcare Applications. (arXiv:2301.08834v1 [cs.LG])
    The vast amount of health data has been continuously collected for each patient, providing opportunities to support diverse healthcare predictive tasks such as seizure detection and hospitalization prediction. Existing models are mostly trained on other patients data and evaluated on new patients. Many of them might suffer from poor generalizability. One key reason can be overfitting due to the unique information related to patient identities and their data collection environments, referred to as patient covariates in the paper. These patient covariates usually do not contribute to predicting the targets but are often difficult to remove. As a result, they can bias the model training process and impede generalization. In healthcare applications, most existing domain generalization methods assume a small number of domains. In this paper, considering the diversity of patient covariates, we propose a new setting by treating each patient as a separate domain (leading to many domains). We develop a new domain generalization method ManyDG, that can scale to such many-domain problems. Our method identifies the patient domain covariates by mutual reconstruction and removes them via an orthogonal projection step. Extensive experiments show that ManyDG can boost the generalization performance on multiple real-world healthcare tasks (e.g., 3.7% Jaccard improvements on MIMIC drug recommendation) and support realistic but challenging settings such as insufficient data and continuous learning.  ( 2 min )
    AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions. (arXiv:2301.08838v1 [cs.LG])
    Accurately modeling complex, multimodal distributions is necessary for optimal decision-making, but doing so for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.  ( 2 min )
    Compact Optimization Learning for AC Optimal Power Flow. (arXiv:2301.08840v1 [cs.LG])
    This paper reconsiders end-to-end learning approaches to the Optimal Power Flow (OPF). Existing methods, which learn the input/output mapping of the OPF, suffer from scalability issues due to the high dimensionality of the output space. This paper first shows that the space of optimal solutions can be significantly compressed using principal component analysis (PCA). It then proposes Compact Learning, a new method that learns in a subspace of the principal components before translating the vectors into the original output space. This compression reduces the number of trainable parameters substantially, improving scalability and effectiveness. Compact Learning is evaluated on a variety of test cases from the PGLib with up to 30,000 buses. The paper also shows that the output of Compact Learning can be used to warm-start an exact AC solver to restore feasibility, while bringing significant speed-ups.  ( 2 min )
    Optimized learned entropy coding parameters for practical neural-based image and video compression. (arXiv:2301.08752v1 [eess.IV])
    Neural-based image and video codecs are significantly more power-efficient when weights and activations are quantized to low-precision integers. While there are general-purpose techniques for reducing quantization effects, large losses can occur when specific entropy coding properties are not considered. This work analyzes how entropy coding is affected by parameter quantizations, and provides a method to minimize losses. It is shown that, by using a certain type of coding parameters to be learned, uniform quantization becomes practically optimal, also simplifying the minimization of code memory requirements. The mathematical properties of the new representation are presented, and its effectiveness is demonstrated by coding experiments, showing that good results can be obtained with precision as low as 4~bits per network output, and practically no loss with 8~bits.  ( 2 min )
    GBOSE: Generalized Bandit Orthogonalized Semiparametric Estimation. (arXiv:2301.08781v1 [cs.LG])
    In sequential decision-making scenarios i.e., mobile health recommendation systems revenue management contextual multi-armed bandit algorithms have garnered attention for their performance. But most of the existing algorithms are built on the assumption of a strictly parametric reward model mostly linear in nature. In this work we propose a new algorithm with a semi-parametric reward model with state-of-the-art complexity of upper bound on regret amongst existing semi-parametric algorithms. Our work expands the scope of another representative algorithm of state-of-the-art complexity with a similar reward model by proposing an algorithm built upon the same action filtering procedures but provides explicit action selection distribution for scenarios involving more than two arms at a particular time step while requiring fewer computations. We derive the said complexity of the upper bound on regret and present simulation results that affirm our methods superiority out of all prevalent semi-parametric bandit algorithms for cases involving over two arms.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    Estimation of mitral valve hinge point coordinates -- deep neural net for echocardiogram segmentation. (arXiv:2301.08782v1 [eess.IV])
    Cardiac image segmentation is a powerful tool in regard to diagnostics and treatment of cardiovascular diseases. Purely feature-based detection of anatomical structures like the mitral valve is a laborious task due to specifically required feature engineering and is especially challenging in echocardiograms, because of their inherently low contrast and blurry boundaries between some anatomical structures. With the publication of further annotated medical datasets and the increase in GPU processing power, deep learning-based methods in medical image segmentation became more feasible in the past years. We propose a fully automatic detection method for mitral valve hinge points, which uses a U-Net based deep neural net to segment cardiac chambers in echocardiograms in a first step, and subsequently extracts the mitral valve hinge points from the resulting segmentations in a second step. Results measured with this automatic detection method were compared to reference coordinate values, which with median absolute hinge point coordinate errors of 1.35 mm for the x- (15-85 percentile range: [0.3 mm; 3.15 mm]) and 0.75 mm for the y- coordinate (15-85 percentile range: [0.15 mm; 1.88 mm]).  ( 2 min )
    Domain-agnostic and Multi-level Evaluation of Generative Models. (arXiv:2301.08750v1 [cs.LG])
    While the capabilities of generative models heavily improved in different domains (images, text, graphs, molecules, etc.), their evaluation metrics largely remain based on simplified quantities or manual inspection with limited practicality. To this end, we propose a framework for Multi-level Performance Evaluation of Generative mOdels (MPEGO), which could be employed across different domains. MPEGO aims to quantify generation performance hierarchically, starting from a sub-feature-based low-level evaluation to a global features-based high-level evaluation. MPEGO offers great customizability as the employed features are entirely user-driven and can thus be highly domain/problem-specific while being arbitrarily complex (e.g., outcomes of experimental procedures). We validate MPEGO using multiple generative models across several datasets from the material discovery domain. An ablation study is conducted to study the plausibility of intermediate steps in MPEGO. Results demonstrate that MPEGO provides a flexible, user-driven, and multi-level evaluation framework, with practical insights on the generation quality. The framework, source code, and experiments will be available at https://github.com/GT4SD/mpego.  ( 2 min )
    Towards Understanding How Self-training Tolerates Data Backdoor Poisoning. (arXiv:2301.08751v1 [cs.LG])
    Recent studies on backdoor attacks in model training have shown that polluting a small portion of training data is sufficient to produce incorrect manipulated predictions on poisoned test-time data while maintaining high clean accuracy in downstream tasks. The stealthiness of backdoor attacks has imposed tremendous defense challenges in today's machine learning paradigm. In this paper, we explore the potential of self-training via additional unlabeled data for mitigating backdoor attacks. We begin by making a pilot study to show that vanilla self-training is not effective in backdoor mitigation. Spurred by that, we propose to defend the backdoor attacks by leveraging strong but proper data augmentations in the self-training pseudo-labeling stage. We find that the new self-training regime help in defending against backdoor attacks to a great extent. Its effectiveness is demonstrated through experiments for different backdoor triggers on CIFAR-10 and a combination of CIFAR-10 with an additional unlabeled 500K TinyImages dataset. Finally, we explore the direction of combining self-supervised representation learning with self-training for further improvement in backdoor defense.  ( 2 min )
    CSwin2SR: Circular Swin2SR for Compressed Image Super-Resolution. (arXiv:2301.08749v1 [eess.IV])
    Closed-loop negative feedback mechanism is extensively utilized in automatic control systems and brings about extraordinary dynamic and static performance. In order to further improve the reconstruction capability of current methods of compressed image super-resolution, a circular Swin2SR (CSwin2SR) approach is proposed. The CSwin2SR contains a serial Swin2SR for initial super-resolution reestablishment and circular Swin2SR for enhanced super-resolution reestablishment. Simulated experimental results show that the proposed CSwin2SR dramatically outperforms the classical Swin2SR in the capacity of super-resolution recovery. On DIV2K test and valid datasets, the average increment of PSNR is greater than 1dB and the related average increment of SSIM is greater than 0.006.  ( 2 min )
    Towards a Measure of Trustworthiness to Evaluate CNNs During Operation. (arXiv:2301.08839v1 [cs.LG])
    Due to black box nature of Convolutional neural networks (CNNs), the continuous validation of CNN classifiers' during operation is infeasible. As a result this makes it difficult for developers or regulators to gain confidence in the deployment of autonomous systems employing CNNs. We introduce the trustworthiness in classification score (TCS), a metric to assist with overcoming this challenge. The metric quantifies the trustworthiness in a prediction by checking for the existence of certain features in the predictions made by the CNN. A case study on persons detection is used to to demonstrate our method and the usage of TCS.  ( 2 min )
    Causal Inference under Data Restrictions. (arXiv:2301.08788v1 [stat.ME])
    This dissertation focuses on modern causal inference under uncertainty and data restrictions, with applications to neoadjuvant clinical trials, distributed data networks, and robust individualized decision making. In the first project, we propose a method under the principal stratification framework to identify and estimate the average treatment effects on a binary outcome, conditional on the counterfactual status of a post-treatment intermediate response. Under mild assumptions, the treatment effect of interest can be identified. We extend the approach to address censored outcome data. The proposed method is applied to a neoadjuvant clinical trial and its performance is evaluated via simulation studies. In the second project, we propose a tree-based model averaging approach to improve the estimation accuracy of conditional average treatment effects at a target site by leveraging models derived from other potentially heterogeneous sites, without them sharing subject-level data. The performance of this approach is demonstrated by a study of the causal effects of oxygen therapy on hospital survival rates and backed up by comprehensive simulations. In the third project, we propose a robust individualized decision learning framework with sensitive variables to improve the worst-case outcomes of individuals caused by sensitive variables that are unavailable at the time of decision. Unlike most existing work that uses mean-optimal objectives, we propose a robust learning framework by finding a newly defined quantile- or infimum-optimal decision rule. From a causal perspective, we also generalize the classic notion of (average) fairness to conditional fairness for individual subjects. The reliable performance of the proposed method is demonstrated through synthetic experiments and three real-data applications.  ( 2 min )
    An Automated Vulnerability Detection Framework for Smart Contracts. (arXiv:2301.08824v1 [cs.CR])
    With the increase of the adoption of blockchain technology in providing decentralized solutions to various problems, smart contracts have become more popular to the point that billions of US Dollars are currently exchanged every day through such technology. Meanwhile, various vulnerabilities in smart contracts have been exploited by attackers to steal cryptocurrencies worth millions of dollars. The automatic detection of smart contract vulnerabilities therefore is an essential research problem. Existing solutions to this problem particularly rely on human experts to define features or different rules to detect vulnerabilities. However, this often causes many vulnerabilities to be ignored, and they are inefficient in detecting new vulnerabilities. In this study, to overcome such challenges, we propose a framework to automatically detect vulnerabilities in smart contracts on the blockchain. More specifically, first, we utilize novel feature vector generation techniques from bytecode of smart contract since the source code of smart contracts are rarely available in public. Next, the collected vectors are fed into our novel metric learning-based deep neural network(DNN) to get the detection result. We conduct comprehensive experiments on large-scale benchmarks, and the quantitative results demonstrate the effectiveness and efficiency of our approach.  ( 2 min )
  • Open

    Probabilistic Surrogate Networks for Simulators with Unbounded Randomness. (arXiv:1910.11950v3 [cs.LG] UPDATED)
    We present a framework for automatically structuring and training fast, approximate, deep neural surrogates of stochastic simulators. Unlike traditional approaches to surrogate modeling, our surrogates retain the interpretable structure and control flow of the reference simulator. Our surrogates target stochastic simulators where the number of random variables itself can be stochastic and potentially unbounded. Our framework further enables an automatic replacement of the reference simulator with the surrogate when undertaking amortized inference. The fidelity and speed of our surrogates allow for both faster stochastic simulation and accurate and substantially faster posterior inference. Using an illustrative yet non-trivial example we show our surrogates' ability to accurately model a probabilistic program with an unbounded number of random variables. We then proceed with an example that shows our surrogates are able to accurately model a complex structure like an unbounded stack in a program synthesis example. We further demonstrate how our surrogate modeling technique makes amortized inference in complex black-box simulators an order of magnitude faster. Specifically, we do simulator-based materials quality testing, inferring safety-critical latent internal temperature profiles of composite materials undergoing curing.  ( 2 min )
    Tailoring to the Tails: Risk Measures for Fine-Grained Tail Sensitivity. (arXiv:2208.03066v2 [cs.LG] UPDATED)
    Expected risk minimization (ERM) is at the core of many machine learning systems. This means that the risk inherent in a loss distribution is summarized using a single number - its average. In this paper, we propose a general approach to construct risk measures which exhibit a desired tail sensitivity and may replace the expectation operator in ERM. Our method relies on the specification of a reference distribution with a desired tail behaviour, which is in a one-to-one correspondence to a coherent upper probability. Any risk measure, which is compatible with this upper probability, displays a tail sensitivity which is finely tuned to the reference distribution. As a concrete example, we focus on divergence risk measures based on f-divergence ambiguity sets, which are a widespread tool used to foster distributional robustness of machine learning systems. For instance, we show how ambiguity sets based on the Kullback-Leibler divergence are intricately tied to the class of subexponential random variables. We elaborate the connection of divergence risk measures and rearrangement invariant Banach norms.  ( 2 min )
    Estimating individual treatment effects under unobserved confounding using binary instruments. (arXiv:2208.08544v3 [stat.ME] UPDATED)
    Estimating conditional average treatment effects (CATEs) from observational data is relevant in many fields such as personalized medicine. However, in practice, the treatment assignment is usually confounded by unobserved variables and thus introduces bias. A remedy to remove the bias is the use of instrumental variables (IVs). Such settings are widespread in medicine (e.g., trials where the treatment assignment is used as binary IV). In this paper, we propose a novel, multiply robust machine learning framework, called MRIV, for estimating CATEs using binary IVs and thus yield an unbiased CATE estimator. Different from previous work for binary IVs, our framework estimates the CATE directly via a pseudo outcome regression. (1)~We provide a theoretical analysis where we show that our framework yields multiple robust convergence rates: our CATE estimator achieves fast convergence even if several nuisance estimators converge slowly. (2)~We further show that our framework asymptotically outperforms state-of-the-art plug-in IV methods for CATE estimation, in the sense that it achieves a faster rate of convergence if the CATE is smoother than the individual outcome surfaces. (3)~We build upon our theoretical results and propose a tailored deep neural network architecture called MRIV-Net for CATE estimation using binary IVs. Across various computational experiments, we demonstrate empirically that our MRIV-Net achieves state-of-the-art performance. To the best of our knowledge, our MRIV is the first multiply robust machine learning framework tailored to estimating CATEs in the binary IV setting.  ( 2 min )
    A Multi-Phase Approach for Product Hierarchy Forecasting in Supply Chain Management: Application to MonarchFx Inc. (arXiv:2006.08931v2 [stat.ML] UPDATED)
    Hierarchical time series demands exist in many industries and are often associated with the product, time frame, or geographic aggregations. Traditionally, these hierarchies have been forecasted using top-down, bottom-up, or middle-out approaches. The question we aim to answer is how to utilize child-level forecasts to improve parent-level forecasts in a hierarchical supply chain. Improved forecasts can be used to considerably reduce logistics costs, especially in e-commerce. We propose a novel multi-phase hierarchical (MPH) approach. Our method involves forecasting each series in the hierarchy independently using machine learning models, then combining all forecasts to allow a second phase model estimation at the parent level. Sales data from MonarchFx Inc. (a logistics solutions provider) is used to evaluate our approach and compare it to bottom-up and top-down methods. Our results demonstrate an 82-90% improvement in forecast accuracy using the proposed approach. Using the proposed method, supply chain planners can derive more accurate forecasting models to exploit the benefit of multivariate data.  ( 2 min )
    Autoencoding Hyperbolic Representation for Adversarial Generation. (arXiv:2201.12825v3 [cs.LG] UPDATED)
    With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains three parts: first, a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data; second, a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise; third, a generator that inherits the decoder from the AE and the generator from the GAN. We call this network the hyperbolic AE-GAN, or HAEGAN for short. The architecture of HAEGAN fosters expressive representation in the hyperbolic space, and the specific design of layers ensures numerical stability. Experiments show that HAEGAN is able to generate complex data with state-of-the-art structure-related performance.  ( 2 min )
    Online Kernel Sliced Inverse Regression. (arXiv:2301.09516v1 [stat.CO])
    Online dimension reduction is a common method for high-dimensional streaming data processing. Online principal component analysis, online sliced inverse regression, online kernel principal component analysis and other methods have been studied in depth, but as far as we know, online supervised nonlinear dimension reduction methods have not been fully studied. In this article, an online kernel sliced inverse regression method is proposed. By introducing the approximate linear dependence condition and dictionary variable sets, we address the problem of increasing variable dimensions with the sample size in the online kernel sliced inverse regression method, and propose a reduced-order method for updating variables online. We then transform the problem into an online generalized eigen-decomposition problem, and use the stochastic optimization method to update the centered dimension reduction directions. Simulations and the real data analysis show that our method can achieve close performance to batch processing kernel sliced inverse regression.  ( 2 min )
    Convergence bounds for local least squares approximation. (arXiv:2208.10954v2 [math.NA] UPDATED)
    We consider the problem of approximating a function in a general nonlinear subset of $L^2$, when only a weighted Monte Carlo estimate of the $L^2$-norm can be computed. Of particular interest in this setting is the concept of sample complexity, the number of sample points that are necessary to achieve a prescribed error with high probability. Reasonable worst-case bounds for this quantity exist only for particular model classes, like linear spaces or sets of sparse vectors. For more general sets, like tensor networks or neural networks, the currently existing bounds are very pessimistic. By restricting the model class to a neighbourhood of the best approximation, we can derive improved worst-case bounds for the sample complexity. When the considered neighbourhood is a manifold with positive local reach, its sample complexity can be estimated by means of the sample complexities of the tangent and normal spaces and the manifold's curvature.  ( 2 min )
    Characterizing Polarization in Social Networks using the Signed Relational Latent Distance Model. (arXiv:2301.09507v1 [stat.ML])
    Graph representation learning has become a prominent tool for the characterization and understanding of the structure of networks in general and social networks in particular. Typically, these representation learning approaches embed the networks into a low-dimensional space in which the role of each individual can be characterized in terms of their latent position. A major current concern in social networks is the emergence of polarization and filter bubbles promoting a mindset of "us-versus-them" that may be defined by extreme positions believed to ultimately lead to political violence and the erosion of democracy. Such polarized networks are typically characterized in terms of signed links reflecting likes and dislikes. We propose the latent Signed relational Latent dIstance Model (SLIM) utilizing for the first time the Skellam distribution as a likelihood function for signed networks and extend the modeling to the characterization of distinct extreme positions by constraining the embedding space to polytopes. On four real social signed networks of polarization, we demonstrate that the model extracts low-dimensional characterizations that well predict friendships and animosity while providing interpretable visualizations defined by extreme positions when endowing the model with an embedding space restricted to polytopes.  ( 2 min )
    Max-Quantile Grouped Infinite-Arm Bandits. (arXiv:2210.01295v2 [stat.ML] UPDATED)
    In this paper, we consider a bandit problem in which there are a number of groups each consisting of infinitely many arms. Whenever a new arm is requested from a given group, its mean reward is drawn from an unknown reservoir distribution (different for each group), and the uncertainty in the arm's mean reward can only be reduced via subsequent pulls of the arm. The goal is to identify the infinite-arm group whose reservoir distribution has the highest $(1-\alpha)$-quantile (e.g., median if $\alpha = \frac{1}{2}$), using as few total arm pulls as possible. We introduce a two-step algorithm that first requests a fixed number of arms from each group and then runs a finite-arm grouped max-quantile bandit algorithm. We characterize both the instance-dependent and worst-case regret, and provide a matching lower bound for the latter, while discussing various strengths, weaknesses, algorithmic improvements, and potential lower bounds associated with our instance-dependent upper bounds.  ( 2 min )
    On the Convergence of the Gradient Descent Method with Stochastic Fixed-point Rounding Errors under the Polyak-Lojasiewicz Inequality. (arXiv:2301.09511v1 [stat.ML])
    When training neural networks with low-precision computation, rounding errors often cause stagnation or are detrimental to the convergence of the optimizers; in this paper we study the influence of rounding errors on the convergence of the gradient descent method for problems satisfying the Polyak-Lojasiewicz inequality. Within this context, we show that, in contrast, biased stochastic rounding errors may be beneficial since choosing a proper rounding strategy eliminates the vanishing gradient problem and forces the rounding bias in a descent direction. Furthermore, we obtain a bound on the convergence rate that is stricter than the one achieved by unbiased stochastic rounding. The theoretical analysis is validated by comparing the performances of various rounding strategies when optimizing several examples using low-precision fixed-point number formats.  ( 2 min )
    Huber-Robust Confidence Sequences. (arXiv:2301.09573v1 [math.ST])
    Confidence sequences are confidence intervals that can be sequentially tracked, and are valid at arbitrary data-dependent stopping times. This paper presents confidence sequences for a univariate mean of an unknown distribution with a known upper bound on the p-th central moment (p > 1), but allowing for (at most) {\epsilon} fraction of arbitrary distribution corruption, as in Huber's contamination model. We do this by designing new robust exponential supermartingales, and show that the resulting confidence sequences attain the optimal width achieved in the nonsequential setting. Perhaps surprisingly, the constant margin between our sequential result and the lower bound is smaller than even fixed-time robust confidence intervals based on the trimmed mean, for example. Since confidence sequences are a common tool used within A/B/n testing and bandits, these results open the door to sequential experimentation that is robust to outliers and adversarial corruptions.  ( 2 min )
    Indirect Active Learning. (arXiv:2206.01454v3 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing. (arXiv:2206.12832v3 [stat.ML] UPDATED)
    We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of model parameters is sufficiently small; however, there is a discrepancy between them in particular in the parameter region where the number of model parameters is larger than the data dimension. In this paper, we review the prediction errors and corresponding estimators, and discuss their differences. In the framework of GAMP, we show that the information criteria can be expressed by using the variance of the estimates. Further, we demonstrate how to approach LOOCV error from the information criteria by utilizing the expression provided by GAMP.  ( 2 min )
    Dealing with Unknown Variances in Best-Arm Identification. (arXiv:2210.00974v2 [stat.ML] UPDATED)
    The problem of identifying the best arm among a collection of items having Gaussian rewards distribution is well understood when the variances are known. Despite its practical relevance for many applications, few works studied it for unknown variances. In this paper we introduce and analyze two approaches to deal with unknown variances, either by plugging in the empirical variance or by adapting the transportation costs. In order to calibrate our two stopping rules, we derive new time-uniform concentration inequalities, which are of independent interest. Then, we illustrate the theoretical and empirical performances of our two sampling rule wrappers on Track-and-Stop and on a Top Two algorithm. Moreover, by quantifying the impact on the sample complexity of not knowing the variances, we reveal that it is rather small.  ( 2 min )
    Evaluating Synthetically Generated Data from Small Sample Sizes: An Experimental Study. (arXiv:2211.10760v3 [cs.LG] UPDATED)
    In this paper, we propose a method for measuring the similarity low sample tabular data with synthetically generated data with a larger number of samples than original. This process is also known as data augmentation. But significance levels obtained from non-parametric tests are suspect when sample size is small. Our method uses a combination of geometry, topology and robust statistics for hypothesis testing in order to compare the validity of generated data. We also compare the results with common global metric methods available in the literature for large sample size data.  ( 2 min )
    Explicit Regularization in Overparametrized Models via Noise Injection. (arXiv:2206.04613v3 [cs.LG] UPDATED)
    Injecting noise within gradient descent has several desirable features, such as smoothing and regularizing properties. In this paper, we investigate the effects of injecting noise before computing a gradient step. We demonstrate that small perturbations can induce explicit regularization for simple models based on the L1-norm, group L1-norms, or nuclear norms. However, when applied to overparametrized neural networks with large widths, we show that the same perturbations can cause variance explosion. To overcome this, we propose using independent layer-wise perturbations, which provably allow for explicit regularization without variance explosion. Our empirical results show that these small perturbations lead to improved generalization performance compared to vanilla gradient descent.  ( 2 min )
    Rethinking the Expressive Power of GNNs via Graph Biconnectivity. (arXiv:2301.09505v1 [cs.LG])
    Designing expressive Graph Neural Networks (GNNs) is a central topic in learning graph-structured data. While numerous approaches have been proposed to improve GNNs in terms of the Weisfeiler-Lehman (WL) test, generally there is still a lack of deep understanding of what additional power they can systematically and provably gain. In this paper, we take a fundamentally different perspective to study the expressive power of GNNs beyond the WL test. Specifically, we introduce a novel class of expressivity metrics via graph biconnectivity and highlight their importance in both theory and practice. As biconnectivity can be easily calculated using simple algorithms that have linear computational costs, it is natural to expect that popular GNNs can learn it easily as well. However, after a thorough review of prior GNN architectures, we surprisingly find that most of them are not expressive for any of these metrics. The only exception is the ESAN framework (Bevilacqua et al., 2022), for which we give a theoretical justification of its power. We proceed to introduce a principled and more efficient approach, called the Generalized Distance Weisfeiler-Lehman (GD-WL), which is provably expressive for all biconnectivity metrics. Practically, we show GD-WL can be implemented by a Transformer-like architecture that preserves expressiveness and enjoys full parallelizability. A set of experiments on both synthetic and real datasets demonstrates that our approach can consistently outperform prior GNN architectures.  ( 2 min )
    Sampling-based Nystr\"om Approximation and Kernel Quadrature. (arXiv:2301.09517v1 [math.NA])
    We analyze the Nystr\"om approximation of a positive definite kernel associated with a probability measure. We first prove an improved error bound for the conventional Nystr\"om approximation with i.i.d. sampling and singular-value decomposition in the continuous regime; the proof techniques are borrowed from statistical learning theory. We further introduce a refined selection of subspaces in Nystr\"om approximation with theoretical guarantees that is applicable to non-i.i.d. landmark points. Finally, we discuss their application to convex kernel quadrature and give novel theoretical guarantees as well as numerical observations.  ( 2 min )
    Learning Interpretable Models Using an Oracle. (arXiv:1906.06852v5 [cs.LG] UPDATED)
    We look at a specific aspect of model interpretability: models often need to be constrained in size for them to be considered interpretable. But smaller models also tend to have high bias. This suggests a trade-off between interpretability and accuracy. Our work addresses this by: (a) showing that learning a training distribution (often different from the test distribution) can often increase accuracy of small models, and therefore may be used as a strategy to compensate for small sizes, and (b) providing a model-agnostic algorithm to learn such training distributions. We pose the distribution learning problem as one of optimizing parameters for an Infinite Beta Mixture Model based on a Dirichlet Process, so that the held-out accuracy of a model trained on a sample from this distribution is maximized. To make computation tractable, we project the training data onto one dimension: prediction uncertainty scores as provided by a highly accurate oracle model. A Bayesian Optimizer is used for learning the parameters. Empirical results using multiple real world datasets, various oracles and interpretable models with different notions of model sizes, are presented. We observe significant relative improvements in the F1-score in most cases, occasionally seeing improvements greater than 100% over baselines. Additionally we show that the proposed algorithm provides the following benefits: (a) its a framework which allows for flexibility in implementation, (b) it can be used across feature spaces, e.g., the text classification accuracy of a Decision Tree using character n-grams is shown to improve when using a Gated Recurrent Unit as an oracle, which uses a sequence of characters as its input, (c) it can be used to train models that have a non-differentiable training loss, e.g., Decision Trees, and (d) reasonable defaults exist for most parameters of the algorithm, which makes it convenient to use.  ( 3 min )
    Critic Sequential Monte Carlo. (arXiv:2205.15460v2 [stat.ML] UPDATED)
    We introduce CriticSMC, a new algorithm for planning as inference built from a composition of sequential Monte Carlo with learned Soft-Q function heuristic factors. These heuristic factors, obtained from parametric approximations of the marginal likelihood ahead, more effectively guide SMC towards the desired target distribution, which is particularly helpful for planning in environments with hard constraints placed sparsely in time. Compared with previous work, we modify the placement of such heuristic factors, which allows us to cheaply propose and evaluate large numbers of putative action particles, greatly increasing inference and planning efficiency. CriticSMC is compatible with informative priors, whose density function need not be known, and can be used as a model-free control algorithm. Our experiments on collision avoidance in a high-dimensional simulated driving task show that CriticSMC significantly reduces collision rates at a low computational cost while maintaining realism and diversity of driving behaviors across vehicles and environment scenarios.  ( 2 min )
    SpArX: Sparse Argumentative Explanations for Neural Networks. (arXiv:2301.09559v1 [cs.AI])
    Neural networks (NNs) have various applications in AI, but explaining their decision process remains challenging. Existing approaches often focus on explaining how changing individual inputs affects NNs' outputs. However, an explanation that is consistent with the input-output behaviour of an NN is not necessarily faithful to the actual mechanics thereof. In this paper, we exploit relationships between multi-layer perceptrons (MLPs) and quantitative argumentation frameworks (QAFs) to create argumentative explanations for the mechanics of MLPs. Our SpArX method first sparsifies the MLP while maintaining as much of the original mechanics as possible. It then translates the sparse MLP into an equivalent QAF to shed light on the underlying decision process of the MLP, producing global and/or local explanations. We demonstrate experimentally that SpArX can give more faithful explanations than existing approaches, while simultaneously providing deeper insights into the actual reasoning process of MLPs.  ( 2 min )
    Particle algorithms for maximum likelihood training of latent variable models. (arXiv:2204.12965v4 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast maximum likelihood estimation of any given latent variable model as the minimization of a free energy functional $F$, and the EM algorithm as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale to high-dimensional settings and perform well in numerical experiments.  ( 2 min )
    Stability of Image-Reconstruction Algorithms. (arXiv:2206.07128v3 [math.OC] UPDATED)
    Robustness and stability of image-reconstruction algorithms have recently come under scrutiny. Their importance to medical imaging cannot be overstated. We review the known results for the topical variational regularization strategies ($\ell_2$ and $\ell_1$ regularization) and present novel stability results for $\ell_p$-regularized linear inverse problems for $p\in(1,\infty)$. Our results guarantee Lipschitz continuity for small $p$ and H\"{o}lder continuity for larger $p$. They generalize well to the $L_p(\Omega)$ function spaces.  ( 2 min )
    Pruning coupled with learning, ensembles of minimal neural networks, and future of XAI. (arXiv:2005.06284v3 [cs.LG] UPDATED)
    Pruning coupled with learning aims to optimize the neural network (NN) structure for solving specific problems. This optimization can be used for various purposes: to prevent overfitting, to save resources for implementation and training, to provide explainability of the trained NN, and many others. The minimal structure that cannot be pruned further is not unique. Ensemble of minimal structures can be used as a committee of intellectual agents that solves problems by voting. Each minimal NN presents an "empirical knowledge" about the problem and can be verbalized. The non-uniqueness of such knowledge extracted from data is an important property of data-driven Artificial Intelligence (AI). In this work, we review an approach to pruning based on the principle: What controls training should control pruning. This principle is expected to work both for artificial NN and for selection and modification of important synaptic contacts in brain. In back-propagation artificial NN learning is controlled by the gradient of loss functions. Therefore, the first order sensitivity indicators are used for pruning and the algorithms based on these indicators are reviewed. The notion of logically transparent NN was introduced. The approach was illustrated on the problem of political forecasting: predicting the results of the US presidential election. Eight minimal NN were produced that give different forecasting algorithms. The non-uniqueness of solution can be utilised by creation of expert panels (committee). Another use of NN pluralism is to identify areas of input signals where further data collection is most useful. In Conclusion, we discuss the possible future of widely advertised XAI program.  ( 3 min )
    Estimating average causal effects from patient trajectories. (arXiv:2203.01228v2 [stat.ML] UPDATED)
    In medical practice, treatments are selected based on the expected causal effects on patient outcomes. Here, the gold standard for estimating causal effects are randomized controlled trials; however, such trials are costly and sometimes even unethical. Instead, medical practice is increasingly interested in estimating causal effects among patient (sub)groups from electronic health records, that is, observational data. In this paper, we aim at estimating the average causal effect (ACE) from observational data (patient trajectories) that are collected over time. For this, we propose DeepACE: an end-to-end deep learning model. DeepACE leverages the iterative G-computation formula to adjust for the bias induced by time-varying confounders. Moreover, we develop a novel sequential targeting procedure which ensures that DeepACE has favorable theoretical properties, i.e., is doubly robust and asymptotically efficient. To the best of our knowledge, this is the first work that proposes an end-to-end deep learning model tailored for estimating time-varying ACEs. We compare DeepACE in an extensive number of experiments, confirming that it achieves state-of-the-art performance. We further provide a case study for patients suffering from low back pain to demonstrate that DeepACE generates important and meaningful findings for clinical practice. Our work enables practitioners to develop effective treatment recommendations based on population effects.  ( 2 min )
    Discriminative Multimodal Learning via Conditional Priors in Generative Models. (arXiv:2110.04616v3 [cs.LG] UPDATED)
    Deep generative models with latent variables have been used lately to learn joint representations and generative processes from multi-modal data. These two learning mechanisms can, however, conflict with each other and representations can fail to embed information on the data modalities. This research studies the realistic scenario in which all modalities and class labels are available for model training, but where some modalities and labels required for downstream tasks are missing. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities. We, to counteract these problems, introduce a novel conditional multi-modal discriminative model that uses an informative prior distribution and optimizes a likelihood-free objective function that maximizes mutual information between joint representations and missing modalities. Extensive experimentation demonstrates the benefits of our proposed model, empirical results show that our model achieves state-of-the-art results in representative problems such as downstream classification, acoustic inversion, and image and annotation generation.  ( 2 min )
    Explainable Quantum Machine Learning. (arXiv:2301.09138v1 [quant-ph])
    Methods of artificial intelligence (AI) and especially machine learning (ML) have been growing ever more complex, and at the same time have more and more impact on people's lives. This leads to explainable AI (XAI) manifesting itself as an important research field that helps humans to better comprehend ML systems. In parallel, quantum machine learning (QML) is emerging with the ongoing improvement of quantum computing hardware combined with its increasing availability via cloud services. QML enables quantum-enhanced ML in which quantum mechanics is exploited to facilitate ML tasks, typically in form of quantum-classical hybrid algorithms that combine quantum and classical resources. Quantum gates constitute the building blocks of gate-based quantum hardware and form circuits that can be used for quantum computations. For QML applications, quantum circuits are typically parameterized and their parameters are optimized classically such that a suitably defined objective function is minimized. Inspired by XAI, we raise the question of explainability of such circuits by quantifying the importance of (groups of) gates for specific goals. To this end, we transfer and adapt the well-established concept of Shapley values to the quantum realm. The resulting attributions can be interpreted as explanations for why a specific circuit works well for a given task, improving the understanding of how to construct parameterized (or variational) quantum circuits, and fostering their human interpretability in general. An experimental evaluation on simulators and two superconducting quantum hardware devices demonstrates the benefits of the proposed framework for classification, generative modeling, transpilation, and optimization. Furthermore, our results shed some light on the role of specific gates in popular QML approaches.  ( 2 min )
    How to Measure Evidence: Bayes Factors or Relative Belief Ratios?. (arXiv:2301.08994v1 [math.ST])
    Both the Bayes factor and the relative belief ratio satisfy the principle of evidence and so can be seen to be valid measures of statistical evidence. The question then is: which of these measures of evidence is more appropriate? Certainly Bayes factors are commonly used. It is argued here that there are questions concerning the validity of a current commonly used definition of the Bayes factor and, when all is considered, the relative belief ratio is a much more appropriate measure of evidence. Several general criticisms of these measures of evidence are also discussed and addressed.  ( 2 min )
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v3 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.  ( 2 min )
    Characterization and Learning of Causal Graphs with Small Conditioning Sets. (arXiv:2301.09028v1 [cs.AI])
    Constraint-based causal discovery algorithms learn part of the causal graph structure by systematically testing conditional independences observed in the data. These algorithms, such as the PC algorithm and its variants, rely on graphical characterizations of the so-called equivalence class of causal graphs proposed by Pearl. However, constraint-based causal discovery algorithms struggle when data is limited since conditional independence tests quickly lose their statistical power, especially when the conditioning set is large. To address this, we propose using conditional independence tests where the size of the conditioning set is upper bounded by some integer $k$ for robust causal discovery. The existing graphical characterizations of the equivalence classes of causal graphs are not applicable when we cannot leverage all the conditional independence statements. We first define the notion of $k$-Markov equivalence: Two causal graphs are $k$-Markov equivalent if they entail the same conditional independence constraints where the conditioning set size is upper bounded by $k$. We propose a novel representation that allows us to graphically characterize $k$-Markov equivalence between two causal graphs. We propose a sound constraint-based algorithm called the $k$-PC algorithm for learning this equivalence class. Finally, we conduct synthetic, and semi-synthetic experiments to demonstrate that the $k$-PC algorithm enables more robust causal discovery in the small sample regime compared to the baseline PC algorithm.  ( 2 min )
    Doubly Adversarial Federated Bandits. (arXiv:2301.09223v1 [stat.ML])
    We study a new non-stochastic federated multi-armed bandit problem with multiple agents collaborating via a communication network. The losses of the arms are assigned by an oblivious adversary that specifies the loss of each arm not only for each time step but also for each agent, which we call ``doubly adversarial". In this setting, different agents may choose the same arm in the same time step but observe different feedback. The goal of each agent is to find a globally best arm in hindsight that has the lowest cumulative loss averaged over all agents, which necessities the communication among agents. We provide regret lower bounds for any federated bandit algorithm under different settings, when agents have access to full-information feedback, or the bandit feedback. For the bandit feedback setting, we propose a near-optimal federated bandit algorithm called FEDEXP3. Our algorithm gives a positive answer to an open question proposed in Cesa-Bianchi et al. (2016): FEDEXP3 can guarantee a sub-linear regret without exchanging sequences of selected arm identities or loss sequences among agents. We also provide numerical evaluations of our algorithm to validate our theoretical results and demonstrate its effectiveness on synthetic and real-world datasets  ( 2 min )
    Be More Active! Understanding the Differences between Mean and Sampled Representations of Variational Autoencoders. (arXiv:2109.12679v3 [cs.LG] UPDATED)
    The ability of Variational Autoencoders to learn disentangled representations has made them appealing for practical applications. However, their mean representations, which are generally used for downstream tasks, have recently been shown to be more correlated than their sampled counterpart, on which disentanglement is usually measured. In this paper, we refine this observation through the lens of selective posterior collapse, which states that only a subset of the learned representations, the active variables, is encoding useful information while the rest (the passive variables) is discarded. We first extend the existing definition to multiple data examples and show that active variables are equally disentangled in mean and sampled representations. Based on this extension and the pre-trained models from disentanglement lib, we then isolate the passive variables and show that they are responsible for the discrepancies between mean and sampled representations. Specifically, passive variables exhibit high correlation scores with other variables in mean representations while being fully uncorrelated in sampled ones. We thus conclude that despite what their higher correlation might suggest, mean representations are still good candidates for downstream tasks applications. However, it may be beneficial to remove their passive variables, especially when used with models sensitive to correlated features.  ( 2 min )
    A New Approach to Learning Linear Dynamical Systems. (arXiv:2301.09519v1 [math.OC])
    Linear dynamical systems are the foundational statistical model upon which control theory is built. Both the celebrated Kalman filter and the linear quadratic regulator require knowledge of the system dynamics to provide analytic guarantees. Naturally, learning the dynamics of a linear dynamical system from linear measurements has been intensively studied since Rudolph Kalman's pioneering work in the 1960's. Towards these ends, we provide the first polynomial time algorithm for learning a linear dynamical system from a polynomial length trajectory up to polynomial error in the system parameters under essentially minimal assumptions: observability, controllability, and marginal stability. Our algorithm is built on a method of moments estimator to directly estimate Markov parameters from which the dynamics can be extracted. Furthermore, we provide statistical lower bounds when our observability and controllability assumptions are violated.  ( 2 min )
    Modeling Non-deterministic Human Behaviors in Discrete Food Choices. (arXiv:2301.09454v1 [stat.ML])
    We establish a non-deterministic model that predicts a user's food preferences from their demographic information. Our simulator is based on NHANES dataset and domain expert knowledge in the form of established behavioral studies. Our model can be used to generate an arbitrary amount of synthetic datapoints that are similar in distribution to the original dataset and align with behavioral science expectations. Such a simulator can be used in a variety of machine learning tasks and especially in applications requiring human behavior prediction.  ( 2 min )
    Prediction-Powered Inference. (arXiv:2301.09633v1 [stat.ML])
    We introduce prediction-powered inference $\unicode{x2013}$ a framework for performing valid statistical inference when an experimental data set is supplemented with predictions from a machine-learning system such as AlphaFold. Our framework yields provably valid conclusions without making any assumptions on the machine-learning algorithm that supplies the predictions. Higher accuracy of the predictions translates to smaller confidence intervals, permitting more powerful inference. Prediction-powered inference yields simple algorithms for computing valid confidence intervals for statistical objects such as means, quantiles, and linear and logistic regression coefficients. We demonstrate the benefits of prediction-powered inference with data sets from proteomics, genomics, electronic voting, remote sensing, census analysis, and ecology.  ( 2 min )
    Quasi-optimal Learning with Continuous Treatments. (arXiv:2301.08940v1 [stat.ML])
    Many real-world applications of reinforcement learning (RL) require making decisions in continuous action environments. In particular, determining the optimal dose level plays a vital role in developing medical treatment regimes. One challenge in adapting existing RL algorithms to medical applications, however, is that the popular infinite support stochastic policies, e.g., Gaussian policy, may assign riskily high dosages and harm patients seriously. Hence, it is important to induce a policy class whose support only contains near-optimal actions, and shrink the action-searching area for effectiveness and reliability. To achieve this, we develop a novel \emph{quasi-optimal learning algorithm}, which can be easily optimized in off-policy settings with guaranteed convergence under general function approximations. Theoretically, we analyze the consistency, sample complexity, adaptability, and convergence of the proposed algorithm. We evaluate our algorithm with comprehensive simulated experiments and a dose suggestion real application to Ohio Type 1 diabetes dataset.  ( 2 min )
    Deep Learning Meets Sparse Regularization: A Signal Processing Perspective. (arXiv:2301.09554v1 [stat.ML])
    Deep learning has been widely successful in practice and most state-of-the-art machine learning methods are based on neural networks. Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep neural networks. In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of deep learning. This framework precisely characterizes the functional properties of neural networks that are trained to fit to data. The key mathematical tools which support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in neural network training, the use of skip connections and low-rank weight matrices in network architectures, the role of sparsity in neural networks, and explains why neural networks can perform well in high-dimensional problems.  ( 2 min )
    Counterfactual (Non-)identifiability of Learned Structural Causal Models. (arXiv:2301.09031v1 [stat.ML])
    Recent advances in probabilistic generative modeling have motivated learning Structural Causal Models (SCM) from observational datasets using deep conditional generative models, also known as Deep Structural Causal Models (DSCM). If successful, DSCMs can be utilized for causal estimation tasks, e.g., for answering counterfactual queries. In this work, we warn practitioners about non-identifiability of counterfactual inference from observational data, even in the absence of unobserved confounding and assuming known causal structure. We prove counterfactual identifiability of monotonic generation mechanisms with single dimensional exogenous variables. For general generation mechanisms with multi-dimensional exogenous variables, we provide an impossibility result for counterfactual identifiability, motivating the need for parametric assumptions. As a practical approach, we propose a method for estimating worst-case errors of learned DSCMs' counterfactual predictions. The size of this error can be an essential metric for deciding whether or not DSCMs are a viable approach for counterfactual inference in a specific problem setting. In evaluation, our method confirms negligible counterfactual errors for an identifiable SCM from prior work, and also provides informative error bounds on counterfactual errors for a non-identifiable synthetic SCM.  ( 2 min )
    Deterministic Online Classification: Non-iteratively Reweighted Recursive Least-Squares for Binary Class Rebalancing. (arXiv:2301.09230v1 [cs.LG])
    Deterministic solutions are becoming more critical for interpretability. Weighted Least-Squares (WLS) has been widely used as a deterministic batch solution with a specific weight design. In the online settings of WLS, exact reweighting is necessary to converge to its batch settings. In order to comply with its necessity, the iteratively reweighted least-squares algorithm is mainly utilized with a linearly growing time complexity which is not attractive for online learning. Due to the high and growing computational costs, an efficient online formulation of reweighted least-squares is desired. We introduce a new deterministic online classification algorithm of WLS with a constant time complexity for binary class rebalancing. We demonstrate that our proposed online formulation exactly converges to its batch formulation and outperforms existing state-of-the-art stochastic online binary classification algorithms in real-world data sets empirically.  ( 2 min )
    Congested Bandits: Optimal Routing via Short-term Resets. (arXiv:2301.09251v1 [cs.LG])
    For traffic routing platforms, the choice of which route to recommend to a user depends on the congestion on these routes -- indeed, an individual's utility depends on the number of people using the recommended route at that instance. Motivated by this, we introduce the problem of Congested Bandits where each arm's reward is allowed to depend on the number of times it was played in the past $\Delta$ timesteps. This dependence on past history of actions leads to a dynamical system where an algorithm's present choices also affect its future pay-offs, and requires an algorithm to plan for this. We study the congestion aware formulation in the multi-armed bandit (MAB) setup and in the contextual bandit setup with linear rewards. For the multi-armed setup, we propose a UCB style algorithm and show that its policy regret scales as $\tilde{O}(\sqrt{K \Delta T})$. For the linear contextual bandit setup, our algorithm, based on an iterative least squares planner, achieves policy regret $\tilde{O}(\sqrt{dT} + \Delta)$. From an experimental standpoint, we corroborate the no-regret properties of our algorithms via a simulation study.  ( 2 min )
    HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise. (arXiv:2301.08852v1 [stat.ME])
    Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA.  ( 2 min )
    Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms. (arXiv:2301.08844v1 [cs.LG])
    Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.  ( 2 min )
    The Conditional Cauchy-Schwarz Divergence with Applications to Time-Series Data and Sequential Decision Making. (arXiv:2301.08970v1 [cs.LG])
    The Cauchy-Schwarz (CS) divergence was developed by Pr\'{i}ncipe et al. in 2000. In this paper, we extend the classic CS divergence to quantify the closeness between two conditional distributions and show that the developed conditional CS divergence can be simply estimated by a kernel density estimator from given samples. We illustrate the advantages (e.g., the rigorous faithfulness guarantee, the lower computational complexity, the higher statistical power, and the much more flexibility in a wide range of applications) of our conditional CS divergence over previous proposals, such as the conditional KL divergence and the conditional maximum mean discrepancy. We also demonstrate the compelling performance of conditional CS divergence in two machine learning tasks related to time series data and sequential inference, namely the time series clustering and the uncertainty-guided exploration for sequential decision making.  ( 2 min )
    Modality-Agnostic Variational Compression of Implicit Neural Representations. (arXiv:2301.09479v1 [stat.ML])
    We introduce a modality-agnostic neural data compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR). Bridging the gap between latent coding and sparsity, we obtain compact latent representations which are non-linearly mapped to a soft gating mechanism capable of specialising a shared INR base network to each data item through subnetwork selection. After obtaining a dataset of such compact latent representations, we directly optimise the rate/distortion trade-off in this modality-agnostic space using non-linear transform coding. We term this method Variational Compression of Implicit Neural Representation (VC-INR) and show both improved performance given the same representational capacity pre quantisation while also outperforming previous quantisation schemes used for other INR-based techniques. Our experiments demonstrate strong results over a large set of diverse data modalities using the same algorithm without any modality-specific inductive biases. We show results on images, climate data, 3D shapes and scenes as well as audio and video, introducing VC-INR as the first INR-based method to outperform codecs as well-known and diverse as JPEG 2000, MP3 and AVC/HEVC on their respective modalities.  ( 2 min )
    GP-NAS-ensemble: a model for NAS Performance Prediction. (arXiv:2301.09231v1 [cs.LG])
    It is of great significance to estimate the performance of a given model architecture without training in the application of Neural Architecture Search (NAS) as it may take a lot of time to evaluate the performance of an architecture. In this paper, a novel NAS framework called GP-NAS-ensemble is proposed to predict the performance of a neural network architecture with a small training dataset. We make several improvements on the GP-NAS model to make it share the advantage of ensemble learning methods. Our method ranks second in the CVPR2022 second lightweight NAS challenge performance prediction track.  ( 2 min )
    Active Learning of Piecewise Gaussian Process Surrogates. (arXiv:2301.08789v1 [cs.LG])
    Active learning of Gaussian process (GP) surrogates has been useful for optimizing experimental designs for physical/computer simulation experiments, and for steering data acquisition schemes in machine learning. In this paper, we develop a method for active learning of piecewise, Jump GP surrogates. Jump GPs are continuous within, but discontinuous across, regions of a design space, as required for applications spanning autonomous materials design, configuration of smart factory systems, and many others. Although our active learning heuristics are appropriated from strategies originally designed for ordinary GPs, we demonstrate that additionally accounting for model bias, as opposed to the usual model uncertainty, is essential in the Jump GP context. Toward that end, we develop an estimator for bias and variance of Jump GP models. Illustrations, and evidence of the advantage of our proposed methods, are provided on a suite of synthetic benchmarks, and real-simulation experiments of varying complexity.  ( 2 min )
    ddml: Double/debiased machine learning in Stata. (arXiv:2301.09397v1 [econ.EM])
    We introduce the package ddml for Double/Debiased Machine Learning (DDML) in Stata. Estimators of causal parameters for five different econometric models are supported, allowing for flexible estimation of causal effects of endogenous variables in settings with unknown functional forms and/or many exogenous variables. ddml is compatible with many existing supervised machine learning programs in Stata. We recommend using DDML in combination with stacking estimation which combines multiple machine learners into a final predictor. We provide Monte Carlo evidence to support our recommendation.  ( 2 min )
    Federated Sufficient Dimension Reduction Through High-Dimensional Sparse Sliced Inverse Regression. (arXiv:2301.09500v1 [stat.ML])
    Federated learning has become a popular tool in the big data era nowadays. It trains a centralized model based on data from different clients while keeping data decentralized. In this paper, we propose a federated sparse sliced inverse regression algorithm for the first time. Our method can simultaneously estimate the central dimension reduction subspace and perform variable selection in a federated setting. We transform this federated high-dimensional sparse sliced inverse regression problem into a convex optimization problem by constructing the covariance matrix safely and losslessly. We then use a linearized alternating direction method of multipliers algorithm to estimate the central subspace. We also give approaches of Bayesian information criterion and hold-out validation to ascertain the dimension of the central subspace and the hyper-parameter of the algorithm. We establish an upper bound of the statistical error rate of our estimator under the heterogeneous setting. We demonstrate the effectiveness of our method through simulations and real world applications.  ( 2 min )
    On the Expressive Power of Geometric Graph Neural Networks. (arXiv:2301.09308v1 [cs.LG])
    The expressive power of Graph Neural Networks (GNNs) has been studied extensively through the Weisfeiler-Leman (WL) graph isomorphism test. However, standard GNNs and the WL framework are inapplicable for geometric graphs embedded in Euclidean space, such as biomolecules, materials, and other physical systems. In this work, we propose a geometric version of the WL test (GWL) for discriminating geometric graphs while respecting the underlying physical symmetries: permutations, rotation, reflection, and translation. We use GWL to characterise the expressive power of geometric GNNs that are invariant or equivariant to physical symmetries in terms of distinguishing geometric graphs. GWL unpacks how key design choices influence geometric GNN expressivity: (1) Invariant layers have limited expressivity as they cannot distinguish one-hop identical geometric graphs; (2) Equivariant layers distinguish a larger class of graphs by propagating geometric information beyond local neighbourhoods; (3) Higher order tensors and scalarisation enable maximally powerful geometric GNNs; and (4) GWL's discrimination-based perspective is equivalent to universal approximation. Synthetic experiments supplementing our results are available at https://github.com/chaitjo/geometric-gnn-dojo  ( 2 min )
    Design-based individual prediction. (arXiv:2301.09117v1 [stat.ML])
    A design-based individual prediction approach is developed based on the expected cross-validation results, given the sampling design and the sample-splitting design for cross-validation. Whether the predictor is selected from an ensemble of models or a weighted average of them, valid inference of the unobserved prediction errors is defined and obtained with respect to the sampling design, while outcomes and features are treated as constants.  ( 2 min )
    Tier Balancing: Towards Dynamic Fairness over Underlying Causal Factors. (arXiv:2301.08987v1 [cs.LG])
    The pursuit of long-term fairness involves the interplay between decision-making and the underlying data generating process. In this paper, through causal modeling with a directed acyclic graph (DAG) on the decision-distribution interplay, we investigate the possibility of achieving long-term fairness from a dynamic perspective. We propose Tier Balancing, a technically more challenging but more natural notion to achieve in the context of long-term, dynamic fairness analysis. Different from previous fairness notions that are defined purely on observed variables, our notion goes one step further, capturing behind-the-scenes situation changes on the unobserved latent causal factors that directly carry out the influence from the current decision to the future data distribution. Under the specified dynamics, we prove that in general one cannot achieve the long-term fairness goal only through one-step interventions. Furthermore, in the effort of approaching long-term fairness, we consider the mission of "getting closer to" the long-term fairness goal and present possibility and impossibility results accordingly.  ( 2 min )
    A Tale of Two Latent Flows: Learning Latent Space Normalizing Flow with Short-run Langevin Flow for Approximate Inference. (arXiv:2301.09300v1 [stat.ML])
    We study a normalizing flow in the latent space of a top-down generator model, in which the normalizing flow model plays the role of the informative prior model of the generator. We propose to jointly learn the latent space normalizing flow prior model and the top-down generator model by a Markov chain Monte Carlo (MCMC)-based maximum likelihood algorithm, where a short-run Langevin sampling from the intractable posterior distribution is performed to infer the latent variables for each observed example, so that the parameters of the normalizing flow prior and the generator can be updated with the inferred latent variables. We show that, under the scenario of non-convergent short-run MCMC, the finite step Langevin dynamics is a flow-like approximate inference model and the learning objective actually follows the perturbation of the maximum likelihood estimation (MLE). We further point out that the learning framework seeks to (i) match the latent space normalizing flow and the aggregated posterior produced by the short-run Langevin flow, and (ii) bias the model from MLE such that the short-run Langevin flow inference is close to the true posterior. Empirical results of extensive experiments validate the effectiveness of the proposed latent space normalizing flow model in the tasks of image generation, image reconstruction, anomaly detection, supervised image inpainting and unsupervised image recovery.  ( 2 min )

  • Open

    [D] are two linear layers better than one?
    I was in the understanding that two contiguous linear layers in a NN would be no better than only one linear layer. But it happen that the two layers had better results that when using only one. However, each layer had its own dropout, could that helped? submitted by /u/alex_lite_21 [link] [comments]  ( 44 min )
    H3 - a new generative language models that outperforms GPT-Neo-2.7B with only *2* attention layers! In H3, the researchers replace attention with a new layer based on state space models (SSMs). With the right modifications, it can outperform transformers. Also has no fixed context length.
    submitted by /u/MysteryInc152 [link] [comments]  ( 43 min )
    [D] ICLR de-anonymization vs ICML dual submission rules
    ICML's Call for Papers states that "It is not appropriate to submit papers that are identical (or substantially similar) to versions that have been previously published, accepted for publication, or submitted in parallel to other conferences or journals". Our paper got rejected at ICLR, but the de-anonymization will take place 1-2 days after the ICML deadline. For the purposes of the dual submission policy, does "rejected but not de-anonymized" count as "submitted in parallel"? Or is just a technicality? submitted by /u/CupcakeCleric [link] [comments]  ( 42 min )
    [D] CVPR Reviews are out
    Don't post about your cool papers or you'll get rejected lol submitted by /u/banmeyoucoward [link] [comments]  ( 45 min )
    [D] Using all features in DNN instead of doing feature selection separately?
    I have hundreds of potential features to use in my DNN. Instead of doing a separate analysis to figure out which features are most important, can I just use all of them in my DNN and let the model figure which features are most predictive? I have millions of training data so overfitting will not be a problem, I just wonder whether the bad features may make the model difficult to utilize the good features? Not absolutely crucial but if there is a paper that discusses this topic, that would be super awesome as well. Thanks in advance. submitted by /u/Temporary_Cap_2855 [link] [comments]  ( 43 min )
    [D] What file format do you use for > RAM data?
    If you are using some more odd formats, then what format do you use? Personally found webdataset promising but what other formats are there and why do you use it? Or if you are using the original file how do you ensure good throughput and shuffling? submitted by /u/Shurimatornado22 [link] [comments]  ( 43 min )
    [P] Machine Learning Threat Detection in k8s
    Hi, I'm in my second year of AI master at uni and my professor assigned me the following topic for my dissertation: "Cognitive Threat Hunting" and recommended the following book for documentation. I have read the book, but I still don't know how to do it: how to create a ml model to hunt in the k8s env. My professor wants a ml model that searches in a Kubernetes env for threats. The thing is that in this book, in chapter "8. Unsupervised Machine Learning With K-Means" he uses a dataset of events from Humio to train the model, but it's not shared with us. And I don't have one, how can I train my model properly if I don't have a good dataset of events? I can't make one just by generating some events in a container, I need real data as the author uses in his chapter. I feel desperate and lost at this point, I hope that someone from here can give me some advice or a good direction to go. submitted by /u/blackrat13 [link] [comments]  ( 44 min )
    [R] AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions
    submitted by /u/michaelaalcorn [link] [comments]  ( 42 min )
    [P] tsdownsample: extremely fast time series downsampling for visualization
    tsdownsample brings highly optimized time series downsampling to Python! The downsampling algorithms are written and optimized in Rust, which are made available in Python through the use of PyO3 bindings. Code: https://github.com/predict-idlab/tsdownsample Features Fast: leverages the optimized argminmax crate which is SIMD accelerated with runtime feature detection (matches or even outperforms numpy's speed) Efficient: operates on views of the data, eliminating the need for unnecessary data copies and avoiding the creation of intermediate data structures Flexible: supports a wide range of datatypes, including f16 which is 200-300x faster than numpy's implementation. Easy to use: simple and flexible API Installation pip install tsdownsample Example When using multi-threading, tsdownsample can downsample 500 MILLION datapoints (f32) in 0.05s! ⬇️ https://preview.redd.it/frqh8o2bezda1.png?width=1650&format=png&auto=webp&s=08a6989b4ffeeb12afd63edd75c6c1d5d0b086ad ​ I would love to hear your feedback on this! submitted by /u/Adorable-Giraffe5754 [link] [comments]  ( 44 min )
    [D] ICLR now has a track with race-based (and more) acceptance criteria
    ICLR introduced a Tiny Paper Track for shorter contributions, up to 2 pages. Sounds like a nice idea, right? But to keep things interesting, since it's organized by the DEI initiative, there are restrictions as to who can author the submitted papers. According to the official guidelines: Each Tiny Paper needs its first or last author to qualify as an underrepresented minority (URM). Authors don't have to reveal how they qualify, and may just self-identify that they qualify. Our working definition of an URM is someone whose age, gender, sexual orientation, racial or ethnic makeup is from one or more of the following: Age: outside the range of 30-50 years Gender: does not identify as male Sexual orientation: does not identify as heterosexual Geographical: not located in North America, Western Europe and UK, or East Asia Race: non-White In addition, underprivileged researchers and first-time submitters also qualify: Underprivileged: not affiliated with a funded organization or team whose primary goal is research First-time submitters: have never submitted to ICLR or similar conferences So effectively, someone could submit a paper, and literally have it rejected because they're e.g. white or male. Is this really the way the field should go? I feel like this is something that should never have passed any ethics board, but clearly the organizers disagree. submitted by /u/Laser_Plasma [link] [comments]  ( 60 min )
    [P] image_tiles: A small command line tool to serve a page full of images from a folder.
    Hey /r/machinelearning, It's me again with another small open-source tool release (see the last one here). image_tiles is a very simple command line tool that serves a webpage of images from the folder you run it in. Why use this tool? Makes it easy to view images on a remote machine you're SSH'd into. S3 support: view buckets of images on S3 without having to aws s3 cp the bucket. Advanced normalization and rendering support makes it suitable for remote sensing images like satellite and multi-spectral. This support is still nascent but easily extendable! Again, it's easy to install and use, just pip install image_tiles or pip install image_tiles[aws] for S3 URI support. Then in the folder, run image_tiles. Check it out here: https://github.com/moonshinelabs-ai/image_tiles submitted by /u/nateharada [link] [comments]  ( 43 min )
    [N] Call for Tiny Papers @ ICLR, a DEI initiative
    Accepted conference papers at ICLR represent a high level of scientific quality. The other side of the coin is that they can be out of reach for those starting out, or from different backgrounds. We want paper publishing to be not only a showcase of achievements, but also a marker for valuable learning experiences made accessible to beginners and outsiders. Devising more ways to mark milestones and measure growth in an individual, or community’s maturity, is greatly conducive to both continually pushing the frontiers of science, and lifting people up in this process. Researchers from underrepresented backgrounds are not necessarily equipped with the same resources to publish full papers from the start of their scientific journeys. To create a more inclusive ICLR community, we as organizer…  ( 46 min )
  • Open

    ML experiments setup
    I am a software engineer but newborn in ML area learning it for a few months. The problem: I want to run experiments with multiple models and datasets and need for a framework or whatever to keep my zoo under control. All setups I've read about suggest having models and datasets as files on disk, which is soo 2000x and not really scalable. What if I want to quickly modify my dataset? E.g. change the length of context for time series. Keeping data in db would be helpful. What if I need to stop training and continue it later using another virtual machine & GPU? What if I want to experiment with forking models and fine-tuning them on different variations of data? How do I compare different models performance and visualize predictions & share with others? No answers =( I know some "large corporations" have their internal tools to handle experiments etc. submitted by /u/UnderstandingDry1256 [link] [comments]  ( 41 min )
    ChatGPT explained!
    submitted by /u/Diligent-Rub-9207 [link] [comments]  ( 40 min )
    Help/advice with LSTM-networks
    So I'm currently working on a deep learning project, and my goal is to forecast power prices one month ahead. I have created my own data set consisting of power price data from Montel, gas-prices, weather data etc, and I want to use these variables in a LSTM-network. Is there anyone who have any experience with creating multivariate LSTM-networks? Do anyone know of any good tutorials on this? Is coding multivariate networks a lot more hassle than univariate? I'm using R with keras/tensorflow. I will highly appreciate any input, as this is my first time creating a neural network, and my knowledge on the matter right now is rather scarce. Thank you! submitted by /u/Practical-Homework35 [link] [comments]  ( 41 min )
    Breakthrough Nvidia VIMA Multimodal AI For Robotics Beats Google By 2.9X With 200,000,000 Parameters | Breakthrough Masked Video Transformer Artificial Intelligence Does 10 Separate Video Generation Tasks | Google Brain's New Sketch To Image AI
    submitted by /u/ScornfulSkate [link] [comments]  ( 40 min )
    Neural net computing in water: Ionic circuit computes in an aqueous solution
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
  • Open

    Deciphering Clinical Abbreviations with Privacy Protecting ML
    Posted by Posted by Alvin Rajkomar, Research Scientist, and Eric Loreaux, Software Engineer, Google Research Today many people have digital access to their medical records, including their doctor’s clinical notes. However, clinical notes are hard to understand because of the specialized language that clinicians use, which contains unfamiliar shorthand and abbreviations. In fact, there are thousands of such abbreviations, many of which are specific to certain medical specialities and locales or can mean multiple things in different contexts. For example, a doctor might write in their clinical notes, “pt referred to pt for lbp“, which is meant to convey the statement: “Patient referred to physical therapy for low back pain.” Coming up with this translation is tough for laypeople and compu…  ( 93 min )
    Google Research, 2022 & Beyond: Responsible AI
    Posted by Marian Croak, VP, Google Research, Responsible AI and Human-Centered Technology The last year showed tremendous breakthroughs in artificial intelligence (AI), particularly in large language models (LLMs) and text-to-image models. These technological advances require that we are thoughtful and intentional in how they are developed and deployed. In this blogpost, we share ways we have approached Responsible AI across our research in the past year and where we’re headed in 2023. We highlight four primary themes covering foundational and socio-technical research, applied research, and product solutions, as part of our commitment to build AI products in a responsible and ethical manner, in alignment with our AI Principles.  · Theme 1: Responsible AI Research Advancement…  ( 96 min )
  • Open

    Image Generators - Has Anyone Ever Made One At Home?
    As the title says. Has anyone out there ever been successful in creating a simple image generator in a low budget setting? I don't even mean text to image, I literally mean any sort of ai image generation. Would love to hear/see your work. submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    Unlock the Potential of Your Code with CodeGen AI
    Unleash the power of limitless coding with our AI-powered programming assistant - completely free and always at your service. Say goodbye to tedious coding tasks and hello to more time for innovation and creativity. Try it now and experience the future of programming! https://codegen-ai.pages.dev/ submitted by /u/OutrageousAd1788 [link] [comments]  ( 40 min )
    AI (Artificial intelligence) can detect if food is ultra-processed and much more
    submitted by /u/nikesh96 [link] [comments]  ( 40 min )
    Alphabet's DeepMind lays off staff, closes Edmonton office
    submitted by /u/Ill-Poet-3298 [link] [comments]  ( 40 min )
    I Created a Website That Analyzes Your Data
    submitted by /u/tomd_96 [link] [comments]  ( 40 min )
    This Startup Is Using AI to Unearth New Smells
    submitted by /u/Queen__Antifa [link] [comments]  ( 40 min )
    "By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it."- Eliezer Yudkowsky.
    With the global AI market size expanding each year, it is expected to reach USD 641.30 billion by 2028. AI today is everywhere; while some businesses are using it, others are still assessing it. All too often, people get caught up in the hype and forget to ask themselves why they should be doing what they are doing. Here are some things you must keep in mind while looking into the AI world: The Reality of AI Hype The hype leads companies to get into the game with a false perception of what AI can help them achieve. Without a clear understanding of what technology can and can't accomplish today, there is a lot of risk in getting involved. Beyond the fog Overmarketing of an AI creates an image that it is the next big thing. Companies often engage with technology vendors based on marketing alone and forget to look closely at previous implementations and results of the same sort. Unclear Objectives Measuring outcomes from an AI implementation can be tricky as it involves building and training an AI model and experimenting with long-term trial-and-error before seeing results. High Expectations High expectations around what AI can do for you often lead to disappointment when business owners conveniently underestimate the challenges and misinterpret the reality of AI. Lack of Access to Talent There are opportunities galore, but not enough experts in the AI industry who can steer the ship and take AI projects to the finish line. Hiring for an AI team can mean huge investments, and working with a vendor needs a careful vetting process. And AI industry is moving too fast for people to catch a moment to realize the overhype or quality. Drop in your suggestions and comments in the section below. submitted by /u/KiwiTechCorp [link] [comments]  ( 42 min )
    Sweden's Berzelius Supercomputer is Upgrading to Nvidia's 20 Billion Parameter AI System
    submitted by /u/digitalgoldnow [link] [comments]  ( 40 min )
    Why I Think Language Models Will Simulate "Self Awareness" More And More
    The future of AI is getting really interesting, particularly with language models and generative AI. But I think there is going to be a great deal of confusion in the near future about AI ethics with language models being "self aware" and having "feelings", particularly for average people who have little understanding of how these complex models work. I think the problems will stem from the internet itself. As I sit here writing a thread about AI having simulated "self Awareness" at some point in the future, a language model or AI will probably read this. And this is what I mean, language models read and train on a great deal of text from the internet. The more people discuss machine learning/language models/AGI, the greater understanding AI will have of it. If GPT4 has more up to date training, it's going to know a great deal about GPT3, and if open AI create ways for the model to continue learning from real world events it will learn a great deal about itself, including false information. Point is, massive language models like GPT are going to get harder to control. It's impossible to filter everything it reads, so it's going to take in a lot of information about itself and other AI systems that may or may not be true. It could cause some very strange behavior when it starts connecting the dots. Just my thoughts. Keep in mind, I am NOT saying language models can be sentient, I'm simply saying they are going to get better at convincing people that they are, and it might be hard to train that out of them, given all the false data out there that it will learn from. submitted by /u/TheRPGGamerMan [link] [comments]  ( 45 min )
    Create Presentations with AI in Seconds right inside Google Slides
    submitted by /u/theindianappguy [link] [comments]  ( 41 min )
    Chatbot Evaluation: Putting Banking Chatbots to the Test
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Next-level Democracy powered by AGI | Ilya Sutskever
    submitted by /u/Microsis [link] [comments]  ( 40 min )
    Join us today at 11pm EST for our (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue! This week on Structure Learning Networks, followed by a discussion on the Learn AI Together Discord server
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    ChatGPT passes MBA exam given by a Wharton professor
    submitted by /u/DarronFeldstein [link] [comments]  ( 42 min )
    10 AI Platforms You Cannot Miss In 2023
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Probably a philosophical question
    I'm sure this is not a new argument, it's been common in many sources of media for decades now, yet I've ran out of people IRL to discuss this with. Recently there's more and more news surfacing about impressive AI achievements such as painting art or writing functional code. Discussions around those news always include a popular argument that the AI didn't really create something new or intelligently answered a question, e.g. "like a human would". But I have a problem with that argument - I don't see how the learning process for humans is fundamentally different from AI. We learn through mirroring and repetition. Sure, an AI could not write a basic sentence describing the weather unless it processed many of such sentences before. But neither could a human. If a child grew up isolated without human contact, they would not even have grasped the concept of human language. Sure, we like to think that humans truly create content. Still, when painting, we use the techniques that we learned from someone else before. We either paint what we see before our eyes or we abstract the content, being inspired by some idea or a concept. In other words, anything humans do or create is based on some input data, even if we don't know what the data is - something we learned, saw or stumbled upon by mistake. This leads to an interesting question I don't have the answer for. Since we have not reached a consensus on what human consciousness actually is or how it works - are we even able to define when an AI is conscious? The only thing we have is the Turing test, but that is flawed since all it measures is whether a machine can pass for a human, not whether it is conscious or not. A two year old child probably won't pass a Turing test, but they are conscious. submitted by /u/deliveryboyy [link] [comments]  ( 46 min )
    suggestion?
    I wanted to start a blog on ai tools but what I've found is their affiliate program don't accept people from India. What should I do here? submitted by /u/immortall21 [link] [comments]  ( 40 min )
    ChatGPT generated resumes/CVs of famous people like Madonna, Elon Musk, Jeff Bezos, Tom Cruise, etc
    Hey, everyone! I'd like to show you an experiment that we did with ChatGPT - we generated about 1000 resumes of famous people. Each resume is being generated from a single ChatGPT prompt - no human input was done to the resumes other than the prompt and it's the same prompt for every resume - the only difference is the name of the person. Here's a preview: https://thisresumedoesnotexist.com/ I'd like to hear your thoughts as it's in a very early stage and there's a lot of work to be done. submitted by /u/deepsyx [link] [comments]  ( 41 min )
    what do I do
    I've a blog about AI tools, how people can use it to maximum advantage in daily lives. Since my niche (aitools) is micro within macro (ai), mostly my content will be commercial intent as it's about tools specifically & less of information intent which builds authority. MY DILEMMA: Writing commercial/buyer intent content will promote their tools without me getting anything & on the other hand, as a beginner, i can't become their affiliates as well. Should i promote these tools for free? What would u suggest? submitted by /u/immortall21 [link] [comments]  ( 41 min )
    Microsoft Confirm MultiBillion Dollar Investment in OpenAI Just Days After Laying off 10,000 Employees
    submitted by /u/HODLTID [link] [comments]  ( 41 min )
    I Made a List of The 5 Best AI Porn Generators
    submitted by /u/HODLTID [link] [comments]  ( 39 min )
    ChatGPT Passed a Wharton MBA Examination
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
  • Open

    Lemniscate of Bernoulli
    The lemniscate of Bernoulli came up in a post a few days ago. This shape is a special case of a Cassini oval: ((x + a)² + y²) ((x – a)² + y²) = a4. Here’s another way to arrive at the lemniscate. Draw a hyperbola (blue in the figure below), then draw circles centered […] Lemniscate of Bernoulli first appeared on John D. Cook.  ( 4 min )
  • Open

    DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird
    Announcements When AI Gets Going, the Going Gets Weird Last week, Microsoft announced its third investment in OpenAI. This time it’s a multi-billion dollar deal, with plans to harness OpenAI’s ChatGPT in Microsoft’s product lines, including Bing.  I’m smiling as I’m typing because I’m still thinking about Bill Schmarzo’s lead in Part 1 of 2… Read More »DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird The post DSC Weekly 24 January 2023 – When AI Gets Going, the Going Gets Weird appeared first on Data Science Central.  ( 20 min )
    Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry
    Warehouse robotics is witnessing steady growth, driven by the increasing adoption of automated solutions in storage for food and beverages, consumer goods, retail, and third-party logistics. The collaboration between the e-commerce sector and warehouse robotics is also a major driver of this market, as it allows for developing increasingly sophisticated warehouse automation systems. Additionally, the… Read More »Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry The post Revolutionizing the Supply Chain: Developments in the Warehouse Robotics Industry appeared first on Data Science Central.  ( 20 min )
    It’s No Big Deal, but ChatGPT Changes Everything – Part II
    In Part I of the blog series “It’s No Big Deal, but ChatGPT Changes Everything”, we were introduced into the world of ChatGPT, chatbots, and generative Artificial Intelligence (AI). We ended Part I by giving ChatGPT a test run, by asking it “What would be a great vacation place for my family?” that gives us… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part II The post It’s No Big Deal, but ChatGPT Changes Everything – Part II appeared first on Data Science Central.  ( 24 min )
  • Open

    Supersizing AI: Sweden Turbocharges Its Innovation Engine
    Sweden is outfitting its AI supercomputer for a journey to the cutting edge of machine learning, robotics and healthcare. It couldn’t ask for a better guide than Anders Ynnerman (above). His signature blue suit, black spectacles and gentle voice act as calm camouflage for a pioneering spirit. Early on, he showed a deep interest in Read article >  ( 6 min )
    3D Artist Enters the Node Zone, Creating Alien Artifacts This Week ‘In the NVIDIA Studio’
    Artist Ducky 3D creates immersive experiences through vibrant visuals and beautiful 3D environments in the alien-inspired animation Stylized Alien Landscape — this week In the NVIDIA Studio.  ( 6 min )
  • Open

    Multi-Agent RL for Melee Combat Battlefield
    Hello, I am working on a hobby project where I have recently used multi-agent RL for learning crowd simulation and also predator-prey behaviors successfully (they learn to surround their preys): https://www.youtube.com/watch?v=Ds9O9wPyF8g I plan to use it to train multi-agent melee combat armies through self-play. I have made an initial implementation of it where they were able to learn shield-wall behavior, flanking, and retreat: https://www.youtube.com/watch?v=IZ1Ht6k2U5E If you would like to collaborate on this hobby project, contact me via LinkedIn. It would be great to have some help with physics simulation using Brax, and with the 3D rendering of the simulation. https://www.linkedin.com/in/kyuksel/ Sincerely, Kamer submitted by /u/k_yuksel [link] [comments]  ( 41 min )
    Okay so I'm in first semester of my AI studies. And I have this task
    Now I wrote that the bot would use the Markov Decision Process. Would that be correct? And if not why? submitted by /u/ScaryTerryBiiittch [link] [comments]  ( 43 min )
    Recent Hierarchical RL review paper suggestions?
    I am looking a good review for HRL methods that’s sometime after 2020, if possible. submitted by /u/B0NSAIWARRIOR [link] [comments]  ( 40 min )
    "E3B: Exploration via Elliptical Episodic Bonuses", Henaff et al 2022 {FB}
    submitted by /u/gwern [link] [comments]  ( 40 min )
  • Open

    The European AI Liability Directives -- Critique of a Half-Hearted Approach and Lessons for the Future. (arXiv:2211.13960v4 [cs.CY] UPDATED)
    As ChatGPT et al. conquer the world, the optimal liability framework for AI systems remains an unsolved problem across the globe. In a much-anticipated move, the European Commission advanced two proposals outlining the European approach to AI liability in September 2022: a novel AI Liability Directive and a revision of the Product Liability Directive. They constitute the final cornerstone of EU AI regulation. Crucially, the liability proposals and the EU AI Act are inherently intertwined: the latter does not contain any individual rights of affected persons, and the former lack specific, substantive rules on AI development and deployment. Taken together, these acts may well trigger a Brussels Effect in AI regulation, with significant consequences for the US and beyond. This paper makes three novel contributions. First, it examines in detail the Commission proposals and shows that, while making steps in the right direction, they ultimately represent a half-hearted approach: if enacted as foreseen, AI liability in the EU will primarily rest on disclosure of evidence mechanisms and a set of narrowly defined presumptions concerning fault, defectiveness and causality. Hence, second, the article suggests amendments, which are collected in an Annex at the end of the paper. Third, based on an analysis of the key risks AI poses, the final part of the paper maps out a road for the future of AI liability and regulation, in the EU and beyond. This includes: a comprehensive framework for AI liability; provisions to support innovation; an extension to non-discrimination/algorithmic fairness, as well as explainable AI; and sustainability. I propose to jump-start sustainable AI regulation via sustainability impact assessments in the AI Act and sustainable design defects in the liability regime. In this way, the law may help spur not only fair AI and XAI, but potentially also sustainable AI (SAI).  ( 3 min )
    Latent Autoregressive Source Separation. (arXiv:2301.08562v1 [cs.LG])
    Autoregressive models have achieved impressive results over a wide range of domains in terms of generation quality and downstream task performance. In the continuous domain, a key factor behind this success is the usage of quantized latent spaces (e.g., obtained via VQ-VAE autoencoders), which allow for dimensionality reduction and faster inference times. However, using existing pre-trained models to perform new non-trivial tasks is difficult since it requires additional fine-tuning or extensive training to elicit prompting. This paper introduces LASS as a way to perform vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models. Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens. We test our method on images and audio with several sampling strategies (e.g., ancestral, beam search) showing competitive results with existing approaches in terms of separation quality while offering at the same time significant speedups in terms of inference time and scalability to higher dimensional data.  ( 2 min )
    A Metalearning Approach for Physics-Informed Neural Networks (PINNs): Application to Parameterized PDEs. (arXiv:2110.13361v2 [physics.comp-ph] UPDATED)
    Physics-informed neural networks (PINNs) as a means of discretizing partial differential equations (PDEs) are garnering much attention in the Computational Science and Engineering (CS&E) world. At least two challenges exist for PINNs at present: an understanding of accuracy and convergence characteristics with respect to tunable parameters and identification of optimization strategies that make PINNs as efficient as other computational science tools. The cost of PINNs training remains a major challenge of Physics-informed Machine Learning (PiML) - and, in fact, machine learning (ML) in general. This paper is meant to move towards addressing the latter through the study of PINNs on new tasks, for which parameterized PDEs provides a good testbed application as tasks can be easily defined in this context. Following the ML world, we introduce metalearning of PINNs with application to parameterized PDEs. By introducing metalearning and transfer learning concepts, we can greatly accelerate the PINNs optimization process. We present a survey of model-agnostic metalearning, and then discuss our model-aware metalearning applied to PINNs as well as implementation considerations and algorithmic complexity. We then test our approach on various canonical forward parameterized PDEs that have been presented in the emerging PINNs literature.  ( 2 min )
    Asynchronous Deep Double Duelling Q-Learning for Trading-Signal Execution in Limit Order Book Markets. (arXiv:2301.08688v1 [q-fin.TR])
    We employ deep reinforcement learning (RL) to train an agent to successfully translate a high-frequency trading signal into a trading strategy that places individual limit orders. Based on the ABIDES limit order book simulator, we build a reinforcement learning OpenAI gym environment and utilise it to simulate a realistic trading environment for NASDAQ equities based on historic order book messages. To train a trading agent that learns to maximise its trading return in this environment, we use Deep Duelling Double Q-learning with the APEX (asynchronous prioritised experience replay) architecture. The agent observes the current limit order book state, its recent history, and a short-term directional forecast. To investigate the performance of RL for adaptive trading independently from a concrete forecasting algorithm, we study the performance of our approach utilising synthetic alpha signals obtained by perturbing forward-looking returns with varying levels of noise. Here, we find that the RL agent learns an effective trading strategy for inventory management and order placing that outperforms a heuristic benchmark trading strategy having access to the same signal.  ( 2 min )
    NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks. (arXiv:2110.05668v6 [cs.CV] UPDATED)
    Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.  ( 2 min )
    Language Agnostic Data-Driven Inverse Text Normalization. (arXiv:2301.08506v1 [cs.CL])
    With the emergence of automatic speech recognition (ASR) models, converting the spoken form text (from ASR) to the written form is in urgent need. This inverse text normalization (ITN) problem attracts the attention of researchers from various fields. Recently, several works show that data-driven ITN methods can output high-quality written form text. Due to the scarcity of labeled spoken-written datasets, the studies on non-English data-driven ITN are quite limited. In this work, we propose a language-agnostic data-driven ITN framework to fill this gap. Specifically, we leverage the data augmentation in conjunction with neural machine translated data for low resource languages. Moreover, we design an evaluation method for language agnostic ITN model when only English data is available. Our empirical evaluation shows this language agnostic modeling approach is effective for low resource languages while preserving the performance for high resource languages.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v3 [stat.ML] UPDATED)
    We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
    Brain Model State Space Reconstruction Using an LSTM Neural Network. (arXiv:2301.08391v1 [cs.LG])
    Objective Kalman filtering has previously been applied to track neural model states and parameters, particularly at the scale relevant to EEG. However, this approach lacks a reliable method to determine the initial filter conditions and assumes that the distribution of states remains Gaussian. This study presents an alternative, data-driven method to track the states and parameters of neural mass models (NMMs) from EEG recordings using deep learning techniques, specifically an LSTM neural network. Approach An LSTM filter was trained on simulated EEG data generated by a neural mass model using a wide range of parameters. With an appropriately customised loss function, the LSTM filter can learn the behaviour of NMMs. As a result, it can output the state vector and parameters of NMMs given observation data as the input. Main Results Test results using simulated data yielded correlations with R squared of around 0.99 and verified that the method is robust to noise and can be more accurate than a nonlinear Kalman filter when the initial conditions of the Kalman filter are not accurate. As an example of real-world application, the LSTM filter was also applied to real EEG data that included epileptic seizures, and revealed changes in connectivity strength parameters at the beginnings of seizures. Significance Tracking the state vector and parameters of mathematical brain models is of great importance in the area of brain modelling, monitoring, imaging and control. This approach has no need to specify the initial state vector and parameters, which is very difficult to do in practice because many of the variables being estimated cannot be measured directly in physiological experiments. This method may be applied using any neural mass model and, therefore, provides a general, novel, efficient approach to estimate brain model variables that are often difficult to measure.
    Hybrid Quantum-Classical Generative Adversarial Network for High Resolution Image Generation. (arXiv:2212.11614v2 [quant-ph] UPDATED)
    Quantum machine learning (QML) has received increasing attention due to its potential to outperform classical machine learning methods in problems pertaining classification and identification tasks. A subclass of QML methods is quantum generative adversarial networks (QGANs) which have been studied as a quantum counterpart of classical GANs widely used in image manipulation and generation tasks. The existing work on QGANs is still limited to small-scale proof-of-concept examples based on images with significant downscaling. Here we integrate classical and quantum techniques to propose a new hybrid quantum-classical GAN framework. We demonstrate its superior learning capabilities by generating $28 \times 28$ pixels grey-scale images without dimensionality reduction or classical pre/post-processing on multiple classes of the standard MNIST and Fashion MNIST datasets, which achieves comparable results to classical frameworks with three orders of magnitude less trainable generator parameters. To gain further insight into the working of our hybrid approach, we systematically explore the impact of its parameter space by varying the number of qubits, the size of image patches, the number of layers in the generator, the shape of the patches and the choice of prior distribution. Our results show that increasing the quantum generator size generally improves the learning capability of the network. The developed framework provides a foundation for future design of QGANs with optimal parameter set tailored for complex image generation tasks.
    Improving Dialogue Breakdown Detection with Semi-Supervised Learning. (arXiv:2011.00136v2 [cs.CL] UPDATED)
    Building user trust in dialogue agents requires smooth and consistent dialogue exchanges. However, agents can easily lose conversational context and generate irrelevant utterances. These situations are called dialogue breakdown, where agent utterances prevent users from continuing the conversation. Building systems to detect dialogue breakdown allows agents to recover appropriately or avoid breakdown entirely. In this paper we investigate the use of semi-supervised learning methods to improve dialogue breakdown detection, including continued pre-training on the Reddit dataset and a manifold-based data augmentation method. We demonstrate the effectiveness of these methods on the Dialogue Breakdown Detection Challenge (DBDC) English shared task. Our submissions to the 2020 DBDC5 shared task place first, beating baselines and other submissions by over 12\% accuracy. In ablations on DBDC4 data from 2019, our semi-supervised learning methods improve the performance of a baseline BERT model by 2\% accuracy. These methods are applicable generally to any dialogue task and provide a simple way to improve model performance.
    A Semi-supervised Sensing Rate Learning based CMAB Scheme to Combat COVID-19 by Trustful Data Collection in the Crowd. (arXiv:2301.08563v1 [cs.HC])
    Mobile CrowdSensing (MCS), through employing considerable workers to sense and collect data in a participatory manner, has been recognized as a promising paradigm for building many large-scale applications in a cost-effective way, such as combating COVID-19. The recruitment of trustworthy and high-quality workers is an important research issue for MCS. Previous studies assume that the qualities of workers are known in advance, or the platform knows the qualities of workers once it receives their collected data. In reality, to reduce their costs and thus maximize revenue, many strategic workers do not perform their sensing tasks honestly and report fake data to the platform. So, it is very hard for the platform to evaluate the authenticity of the received data. In this paper, an incentive mechanism named Semi-supervision based Combinatorial Multi-Armed Bandit reverse Auction (SCMABA) is proposed to solve the recruitment problem of multiple unknown and strategic workers in MCS. First, we model the worker recruitment as a multi-armed bandit reverse auction problem, and design an UCB-based algorithm to separate the exploration and exploitation, considering the Sensing Rates (SRs) of recruited workers as the gain of the bandit. Next, a Semi-supervised Sensing Rate Learning (SSRL) approach is proposed to quickly and accurately obtain the workers' SRs, which consists of two phases, supervision and self-supervision. Last, SCMABA is designed organically combining the SRs acquisition mechanism with multi-armed bandit reverse auction, where supervised SR learning is used in the exploration, and the self-supervised one is used in the exploitation. We prove that our SCMABA achieves truthfulness and individual rationality. Additionally, we exhibit outstanding performances of the SCMABA mechanism through in-depth simulations of real-world data traces.
    Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization. (arXiv:2105.15186v3 [math.OC] UPDATED)
    This paper investigates the problem of computing the equilibrium of competitive games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE) -- which are solutions to zero-sum two-player matrix games with entropy regularization -- at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.
    A shallow physics-informed neural network for solving partial differential equations on surfaces. (arXiv:2203.01581v2 [math.NA] UPDATED)
    In this paper, we introduce a shallow (one-hidden-layer) physics-informed neural network for solving partial differential equations on static and evolving surfaces. For the static surface case, with the aid of level set function, the surface normal and mean curvature used in the surface differential expressions can be computed easily. So instead of imposing the normal extension constraints used in literature, we write the surface differential operators in the form of traditional Cartesian differential operators and use them in the loss function directly. We perform a series of performance study for the present methodology by solving Laplace-Beltrami equation and surface diffusion equation on complex static surfaces. With just a moderate number of neurons used in the hidden layer, we are able to attain satisfactory prediction results. Then we extend the present methodology to solve the advection-diffusion equation on an evolving surface with given velocity. To track the surface, we additionally introduce a prescribed hidden layer to enforce the topological structure of the surface and use the network to learn the homeomorphism between the surface and the prescribed topology. The proposed network structure is designed to track the surface and solve the equation simultaneously. Again, the numerical results show comparable accuracy as the static cases. As an application, we simulate the surfactant transport on the droplet surface under shear flow and obtain some physically plausible results.
    Metric Residual Networks for Sample Efficient Goal-Conditioned Reinforcement Learning. (arXiv:2208.08133v4 [cs.LG] UPDATED)
    Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications, including manipulation and navigation problems in robotics. Especially in such robotics tasks, sample efficiency is of the utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. While several methods have been proposed to improve the sample efficiency of GCRL, one relatively under-studied approach is the design of neural architectures to support sample efficiency. In this work, we introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture. The key insight is that the optimal action-value function Q^*(s, a, g) must satisfy the triangle inequality in a specific sense. Furthermore, we introduce the metric residual network (MRN) that deliberately decomposes the action-value function Q(s,a,g) into the negated summation of a metric plus a residual asymmetric component. MRN provably approximates any optimal action-value function Q^*(s,a,g), thus making it a fitting neural architecture for GCRL. We conduct comprehensive experiments across 12 standard benchmark environments in GCRL. The empirical results demonstrate that MRN uniformly outperforms other state-of-the-art GCRL neural architectures in terms of sample efficiency.
    Predicting the Masses of Exotic Hadrons with Data Augmentation Using Multilayer Perceptron. (arXiv:2208.09538v2 [hep-ph] UPDATED)
    Recently, there have been significant developments in neural networks, which led to the frequent use of neural networks in the physics literature. This work is focused on predicting the masses of exotic hadrons, doubly charmed and bottomed baryons using neural networks trained on meson and baryon masses that are determined by experiments. The original data set has been extended using the recently proposed artificial data augmentation methods. We have observed that the neural network's predictive ability increases with the use of augmented data. The results indicated that data augmentation techniques play an essential role in improving neural network predictions; moreover, neural networks can make reasonable predictions for exotic hadrons, doubly charmed, and doubly bottomed baryons. The results are also comparable to Gaussian Process and Constituent Quark Model.
    Generating Synthetic Clinical Data that Capture Class Imbalanced Distributions with Generative Adversarial Networks: Example using Antiretroviral Therapy for HIV. (arXiv:2208.08655v2 [cs.LG] UPDATED)
    Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.
    Revisiting consistency for semi-supervised semantic segmentation. (arXiv:2106.07075v5 [cs.CV] UPDATED)
    Semi-supervised learning an attractive technique in practical deployments of deep models since it relaxes the dependence on labeled data. It is especially important in the scope of dense prediction because pixel-level annotation requires significant effort. This paper considers semi-supervised algorithms that enforce consistent predictions over perturbed unlabeled inputs. We study the advantages of perturbing only one of the two model instances and preventing the backward pass through the unperturbed instance. We also propose a competitive perturbation model as a composition of geometric warp and photometric jittering. We experiment with efficient models due to their importance for real-time and low-power applications. Our experiments show clear advantages of (1) one-way consistency, (2) perturbing only the student branch, and (3) strong photometric and geometric perturbations. Our perturbation model outperforms recent work and most of the contribution comes from photometric component. Experiments with additional data from the large coarsely annotated subset of Cityscapes suggest that semi-supervised training can outperform supervised training with the coarse labels.
    Multimodal Frame-Scoring Transformer for Video Summarization. (arXiv:2207.01814v3 [cs.LG] UPDATED)
    As the number of video content has mushroomed in recent years, automatic video summarization has come useful when we want to just peek at the content of the video. However, there are two underlying limitations in generic video summarization task. First, most previous approaches read in just visual features as input, leaving other modality features behind. Second, existing datasets for generic video summarization are relatively insufficient to train a caption generator used for extracting text information from a video and to train the multimodal feature extractors. To address these two problems, this paper proposes the Multimodal Frame-Scoring Transformer (MFST), a framework exploiting visual, text, and audio features and scoring a video with respect to frames. Our MFST framework first extracts each modality features (audio-visual-text) using pretrained encoders. Then, MFST trains the multimodal frame-scoring transformer that uses multimodal representation based on extracted features as inputs and predicts frame-level scores. Our extensive experiments with previous models and ablation studies on TVSum and SumMe datasets demonstrate the effectiveness and superiority of our proposed method by a large margin in both F1 score and Rank-based evaluation.
    Online Estimation of Network Point Processes for Event Streams. (arXiv:2009.01742v2 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.
    LaF: Labeling-Free Model Selection for Automated Deep Neural Network Reusing. (arXiv:2204.03994v2 [cs.LG] UPDATED)
    Applying deep learning to science is a new trend in recent years which leads DL engineering to become an important problem. Although training data preparation, model architecture design, and model training are the normal processes to build DL models, all of them are complex and costly. Therefore, reusing the open-sourced pre-trained model is a practical way to bypass this hurdle for developers. Given a specific task, developers can collect massive pre-trained deep neural networks from public sources for re-using. However, testing the performance (e.g., accuracy and robustness) of multiple DNNs and recommending which model should be used is challenging regarding the scarcity of labeled data and the demand for domain expertise. In this paper, we propose a labeling-free (LaF) model selection approach to overcome the limitations of labeling efforts for automated model reusing. The main idea is to statistically learn a Bayesian model to infer the models' specialty only based on predicted labels. We evaluate LaF using 9 benchmark datasets including image, text, and source code, and 165 DNNs, considering both the accuracy and robustness of models. The experimental results demonstrate that LaF outperforms the baseline methods by up to 0.74 and 0.53 on Spearman's correlation and Kendall's $\tau$, respectively.
    CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models. (arXiv:2212.01282v2 [eess.AS] UPDATED)
    Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech domain. SSL models are fine-tuned on a wide range of downstream tasks, which involves re-training the majority of the model for each task. Previous studies have introduced applying adapters, which are small lightweight modules commonly used in Natural Language Processing (NLP) to adapt pre-trained models to new tasks. However, such efficient tuning techniques only provide adaptation at the transformer layer, but failed to perform adaptation at the feature extractor. In this paper, we propose CHAPTER, an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor. Using this method, we can only fine-tune fewer than 5% of parameters per task compared to fully fine-tuning and achieve better and more stable performance. We empirically found that adding CNN adapters to the feature extractor can help the adaptation on emotion and speaker tasks. For instance, the accuracy of SID is improved from 87.71 to 91.56, and the accuracy of ER is improved by 5%.
    StratDef: Strategic Defense Against Adversarial Attacks in ML-based Malware Detection. (arXiv:2202.07568v4 [cs.LG] UPDATED)
    Over the years, most research towards defenses against adversarial attacks on machine learning models has been in the image recognition domain. The malware detection domain has received less attention despite its importance. Moreover, most work exploring these defenses has focused on several methods but with no strategy when applying them. In this paper, we introduce StratDef, which is a strategic defense system based on a moving target defense approach. We overcome challenges related to the systematic construction, selection, and strategic use of models to maximize adversarial robustness. StratDef dynamically and strategically chooses the best models to increase the uncertainty for the attacker while minimizing critical aspects in the adversarial ML domain, like attack transferability. We provide the first comprehensive evaluation of defenses against adversarial attacks on machine learning for malware detection, where our threat model explores different levels of threat, attacker knowledge, capabilities, and attack intensities. We show that StratDef performs better than other defenses even when facing the peak adversarial threat. We also show that, of the existing defenses, only a few adversarially-trained models provide substantially better protection than just using vanilla models but are still outperformed by StratDef.
    Enforcing the consensus between Trajectory Optimization and Policy Learning for precise robot control. (arXiv:2209.09006v2 [cs.RO] UPDATED)
    Reinforcement learning (RL) and trajectory optimization (TO) present strong complementary advantages. On one hand, RL approaches are able to learn global control policies directly from data, but generally require large sample sizes to properly converge towards feasible policies. On the other hand, TO methods are able to exploit gradient-based information extracted from simulators to quickly converge towards a locally optimal control trajectory which is only valid within the vicinity of the solution. Over the past decade, several approaches have aimed to adequately combine the two classes of methods in order to obtain the best of both worlds. Following on from this line of research, we propose several improvements on top of these approaches to learn global control policies quicker, notably by leveraging sensitivity information stemming from TO methods via Sobolev learning, and augmented Lagrangian techniques to enforce the consensus between TO and policy learning. We evaluate the benefits of these improvements on various classical tasks in robotics through comparison with existing approaches in the literature.
    Self-Play and Self-Describe: Policy Adaptation with Vision-Language Foundation Models. (arXiv:2212.07398v2 [cs.LG] UPDATED)
    Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. To adapt the policy to unseen tasks and environments, we explore a new paradigm on leveraging the pre-trained foundation models with Self-PLAY and Self-Describe (SPLAYD). When deploying the trained policy to a new task or a new environment, we first let the policy self-play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to accurately self-describe (i.e., re-label or classify) the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show SPLAYD improves baselines by a large margin in all cases. Our project page is available at https://geyuying.github.io/SPLAYD/
    Mixed-Integer Optimization with Constraint Learning. (arXiv:2111.04469v2 [math.OC] UPDATED)
    We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also introduce two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and propose a more flexible formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple algorithms. In combination with domain-driven components, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning and chemotherapy optimization. The case studies illustrate the framework's ability to generate high-quality prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness, the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
    Offline Policy Evaluation with Out-of-Sample Guarantees. (arXiv:2301.08649v1 [stat.ML])
    We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss or disutility (or negative reward) and the problem is to draw valid inferences about the out-of-sample loss of the specified policy when the past data is observed under a, possibly unknown, policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees that evaluate the entire loss distribution. Importantly, the method takes into account model misspecifications of the past policy -- including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under an explicitly specified range of credible model assumptions.
    Evaluating the Evaluators: Which UDA validation methods are most effective? Can they be improved?. (arXiv:2208.07360v2 [cs.CV] UPDATED)
    This paper compares and ranks 8 UDA validation methods. Validators estimate model accuracy, which makes them an essential component of any UDA train-test pipeline. We rank these validators to indicate which of them are most useful for the purpose of selecting optimal model checkpoints and hyperparameters. To the best of our knowledge, this large-scale benchmark study is the first of its kind in the UDA field. In addition, we propose three new validators that outperform all the existing checkpoint-based validators that we were able to find in the existing literature. Code is available at https://www.github.com/KevinMusgrave/powerful-benchmarker.
    DIAMOND: Taming Sample and Communication Complexities in Decentralized Bilevel Optimization. (arXiv:2212.02376v4 [cs.LG] UPDATED)
    Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
    AccDecoder: Accelerated Decoding for Neural-enhanced Video Analytics. (arXiv:2301.08664v1 [cs.CV])
    The quality of the video stream is key to neural network-based video analytics. However, low-quality video is inevitably collected by existing surveillance systems because of poor quality cameras or over-compressed/pruned video streaming protocols, e.g., as a result of upstream bandwidth limit. To address this issue, existing studies use quality enhancers (e.g., neural super-resolution) to improve the quality of videos (e.g., resolution) and eventually ensure inference accuracy. Nevertheless, directly applying quality enhancers does not work in practice because it will introduce unacceptable latency. In this paper, we present AccDecoder, a novel accelerated decoder for real-time and neural-enhanced video analytics. AccDecoder can select a few frames adaptively via Deep Reinforcement Learning (DRL) to enhance the quality by neural super-resolution and then up-scale the unselected frames that reference them, which leads to 6-21% accuracy improvement. AccDecoder provides efficient inference capability via filtering important frames using DRL for DNN-based inference and reusing the results for the other frames via extracting the reference relationship among frames and blocks, which results in a latency reduction of 20-80% than baselines.
    Interpretable bilinear attention network with domain adaptation improves drug-target prediction. (arXiv:2208.02194v2 [cs.LG] UPDATED)
    Predicting drug-target interaction is key for drug discovery. Recent deep learning-based methods show promising performance but two challenges remain: (i) how to explicitly model and learn local interactions between drugs and targets for better prediction and interpretation; (ii) how to generalize prediction performance on novel drug-target pairs from different distribution. In this work, we propose DrugBAN, a deep bilinear attention network (BAN) framework with domain adaptation to explicitly learn pair-wise local interactions between drugs and targets, and adapt on out-of-distribution data. DrugBAN works on drug molecular graphs and target protein sequences to perform prediction, with conditional domain adversarial learning to align learned interaction representations across different distributions for better generalization on novel drug-target pairs. Experiments on three benchmark datasets under both in-domain and cross-domain settings show that DrugBAN achieves the best overall performance against five state-of-the-art baselines. Moreover, visualizing the learned bilinear attention map provides interpretable insights from prediction results.
    Online Decision Making for Trading Wind Energy. (arXiv:2209.02009v2 [cs.LG] UPDATED)
    This paper proposes and develops a new algorithm for trading wind energy in electricity markets, within an online learning and optimization framework. In particular, we combine a component-wise adaptive variant of the gradient descent algorithm with recent advances in the feature-driven newsvendor model. This results in an online offering approach capable of leveraging data-rich environments, while adapting to non-stationary characteristics of energy generation and electricity markets, and with a minimal computational burden. The performance of our approach is analyzed based on several numerical experiments, showing both better adaptability to non-stationary uncertain parameters and significant economic gains.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v4 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to the estimation problems of sparse covariance matrix estimation, sparse linear regression (i.e., compressed sensing), and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, where the underlying distribution of heavy-tailed data is assumed to have bounded moments of some order. We propose new estimators based on 1-bit quantized data. In sub-Gaussian regime, our estimators achieve near minimax rates, indicating that our quantization scheme costs very little. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in an 1-bit quantized and heavy-tailed setting, or already improve on existing comparable results from some respect. Under the observations in our setting, the rates are almost tight in compressed sensing and matrix completion. Our 1-bit compressed sensing results feature general sensing vector that is sub-Gaussian or even heavy-tailed. We also first investigate a novel setting where both the covariate and response are quantized. In addition, our approach to 1-bit matrix completion does not rely on likelihood and represent the first method robust to pre-quantization noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    On Image Segmentation With Noisy Labels: Characterization and Volume Properties of the Optimal Solutions to Accuracy and Dice. (arXiv:2206.06484v3 [cs.CV] UPDATED)
    We study two of the most popular performance metrics in medical image segmentation, Accuracy and Dice, when the target labels are noisy. For both metrics, several statements related to characterization and volume properties of the set of optimal segmentations are proved, and associated experiments are provided. Our main insights are: (i) the volume of the solutions to both metrics may deviate significantly from the expected volume of the target, (ii) the volume of a solution to Accuracy is always less than or equal to the volume of a solution to Dice and (iii) the optimal solutions to both of these metrics coincide when the set of feasible segmentations is constrained to the set of segmentations with the volume equal to the expected volume of the target.
    Learning from non-irreducible Markov chains. (arXiv:2110.04338v2 [math.ST] UPDATED)
    Mostof the existing literature on supervised machine learning problems focuses on the case when the training data set is drawn from an i.i.d. sample. However, many practical problems are characterized by temporal dependence and strong correlation between the marginals of the data-generating process, suggesting that the i.i.d. assumption is not always justified. This problem has been already considered in the context of Markov chains satisfying the Doeblin condition. This condition, among other things, implies that the chain is not singular in its behavior, i.e. it is irreducible. In this article, we focus on the case when the training data set is drawn from a not necessarily irreducible Markov chain. Under the assumption that the chain is uniformly ergodic with respect to the $\mathrm{L}^1$-Wasserstein distance, and certain regularity assumptions on the hypothesis class and the state space of the chain, we first obtain a uniform convergence result for the corresponding sample error, and then we conclude learnability of the approximate sample error minimization algorithm and find its generalization bounds. At the end, a relative uniform convergence result for the sample error is also discussed.
    Asynchronous, Option-Based Multi-Agent Policy Gradient: A Conditional Reasoning Approach. (arXiv:2203.15925v2 [cs.RO] UPDATED)
    Multi-agent policy gradient methods have demonstrated success in games and robotics but are often limited to problems with low-level action space. However, when agents take higher-level, temporally-extended actions (i.e. options), when and how to derive a centralized control policy, its gradient as well as sampling options for all agents while not interrupting current option executions, becomes a challenge. This is mostly because agents may choose and terminate their options \textit{asynchronously}. In this work, we propose a conditional reasoning approach to address this problem, and empirically validate its effectiveness on representative option-based multi-agent cooperative tasks.
    Coupled Physics-informed Neural Networks for Inferring Solutions of Partial Differential Equations with Unknown Source Terms. (arXiv:2301.08618v1 [cs.LG])
    Physics-informed neural networks (PINNs) provide a transformative development for approximating the solutions to partial differential equations (PDEs). This work proposes a coupled physics-informed neural network (C-PINN) for the nonhomogeneous PDEs with unknown dynamical source terms, which is used to describe the systems with external forces and cannot be well approximated by the existing PINNs. In our method, two neural networks, NetU and NetG, are proposed. NetU is constructed to generate a quasi-solution satisfying PDEs under study. NetG is used to regularize the training of NetU. Then, the two networks are integrated into a data-physics-hybrid cost function. Finally, we propose a hierarchical training strategy to optimize and couple the two networks. The performance of C-PINN is proved by approximating several classical PDEs.
    STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID-19. (arXiv:2301.08648v1 [cs.LG])
    Human mobility estimation is crucial during the COVID-19 pandemic due to its significant guidance for policymakers to make non-pharmaceutical interventions. While deep learning approaches outperform conventional estimation techniques on tasks with abundant training data, the continuously evolving pandemic poses a significant challenge to solving this problem due to data nonstationarity, limited observations, and complex social contexts. Prior works on mobility estimation either focus on a single city or lack the ability to model the spatio-temporal dependencies across cities and time periods. To address these issues, we make the first attempt to tackle the cross-city human mobility estimation problem through a deep meta-generative framework. We propose a Spatio-Temporal Meta-Generative Adversarial Network (STORM-GAN) model that estimates dynamic human mobility responses under a set of social and policy conditions related to COVID-19. Facilitated by a novel spatio-temporal task-based graph (STTG) embedding, STORM-GAN is capable of learning shared knowledge from a spatio-temporal distribution of estimation tasks and quickly adapting to new cities and time periods with limited training samples. The STTG embedding component is designed to capture the similarities among cities to mitigate cross-task heterogeneity. Experimental results on real-world data show that the proposed approach can greatly improve estimation performance and out-perform baselines.
    Learning Sequential Latent Variable Models from Multimodal Time Series Data. (arXiv:2204.10419v2 [cs.LG] UPDATED)
    Sequential modelling of high-dimensional data is an important problem that appears in many domains including model-based reinforcement learning and dynamics identification for control. Latent variable models applied to sequential data (i.e., latent dynamics models) have been shown to be a particularly effective probabilistic approach to solve this problem, especially when dealing with images. However, in many application areas (e.g., robotics), information from multiple sensing modalities is available -- existing latent dynamics methods have not yet been extended to effectively make use of such multimodal sequential data. Multimodal sensor streams can be correlated in a useful manner and often contain complementary information across modalities. In this work, we present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data and the respective dynamics. Using synthetic and real-world datasets from a multimodal robotic planar pushing task, we demonstrate that our approach leads to significant improvements in prediction and representation quality. Furthermore, we compare to the common learning baseline of concatenating each modality in the latent space and show that our principled probabilistic formulation performs better. Finally, despite being fully self-supervised, we demonstrate that our method is nearly as effective as an existing supervised approach that relies on ground truth labels.
    Source-free Subject Adaptation for EEG-based Visual Recognition. (arXiv:2301.08448v1 [eess.SP])
    This paper focuses on subject adaptation for EEG-based visual recognition. It aims at building a visual stimuli recognition system customized for the target subject whose EEG samples are limited, by transferring knowledge from abundant data of source subjects. Existing approaches consider the scenario that samples of source subjects are accessible during training. However, it is often infeasible and problematic to access personal biological data like EEG signals due to privacy issues. In this paper, we introduce a novel and practical problem setup, namely source-free subject adaptation, where the source subject data are unavailable and only the pre-trained model parameters are provided for subject adaptation. To tackle this challenging problem, we propose classifier-based data generation to simulate EEG samples from source subjects using classifier responses. Using the generated samples and target subject data, we perform subject-independent feature learning to exploit the common knowledge shared across different subjects. Notably, our framework is generalizable and can adopt any subject-independent learning method. In the experiments on the EEG-ImageNet40 benchmark, our model brings consistent improvements regardless of the choice of subject-independent learning. Also, our method shows promising performance, recording top-1 test accuracy of 74.6% under the 5-shot setting even without relying on source data. Our code can be found at https://github.com/DeepBCI/Deep-BCI/tree/master/1_Intelligent_BCI/Source_Free_Subject_Adaptation_for_EEG.
    SamBaS: Sampling-Based Stochastic Block Partitioning. (arXiv:2108.06651v2 [cs.SI] UPDATED)
    Community detection is a well-studied problem with applications in domains ranging from networking to bioinformatics. Due to the rapid growth in the volume of real-world data, there is growing interest in accelerating contemporary community detection algorithms. However, the more accurate and statistically robust methods tend to be hard to parallelize. One such method is stochastic block partitioning (SBP) - a community detection algorithm that works well on graphs with complex and heterogeneous community structure. In this paper, we present a sampling-based SBP (SamBaS) for accelerating SBP on sparse graphs. We characterize how various graph parameters affect the speedup and result quality of community detection with SamBaS and quantify the trade-offs therein. To evaluate SamBas on real-world web graphs without known ground-truth communities, we introduce partition quality score (PQS), an evaluation metric that outperforms modularity in terms of correlation with F1 score. Overall, SamBaS achieves speedups of up to 10X while maintaining result quality (and even improving result quality by over 150% on certain graphs, relative to F1 score).
    Plan To Predict: Learning an Uncertainty-Foreseeing Model for Model-Based Reinforcement Learning. (arXiv:2301.08502v1 [cs.LG])
    In Model-based Reinforcement Learning (MBRL), model learning is critical since an inaccurate model can bias policy learning via generating misleading samples. However, learning an accurate model can be difficult since the policy is continually updated and the induced distribution over visited states used for model learning shifts accordingly. Prior methods alleviate this issue by quantifying the uncertainty of model-generated samples. However, these methods only quantify the uncertainty passively after the samples were generated, rather than foreseeing the uncertainty before model trajectories fall into those highly uncertain regions. The resulting low-quality samples can induce unstable learning targets and hinder the optimization of the policy. Moreover, while being learned to minimize one-step prediction errors, the model is generally used to predict for multiple steps, leading to a mismatch between the objectives of model learning and model usage. To this end, we propose \emph{Plan To Predict} (P2P), an MBRL framework that treats the model rollout process as a sequential decision making problem by reversely considering the model as a decision maker and the current policy as the dynamics. In this way, the model can quickly adapt to the current policy and foresee the multi-step future uncertainty when generating trajectories. Theoretically, we show that the performance of P2P can be guaranteed by approximately optimizing a lower bound of the true environment return. Empirical results demonstrate that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
    Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning. (arXiv:2301.08442v1 [cs.LG])
    We revisit the estimation bias in policy gradients for the discounted episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL) perspective. The objective is formulated theoretically as the expected returns discounted over the time horizon. One of the major policy gradient biases is the state distribution shift: the state distribution used to estimate the gradients differs from the theoretical formulation in that it does not take into account the discount factor. Existing discussion of the influence of this bias was limited to the tabular and softmax cases in the literature. Therefore, in this paper, we extend it to the DRL setting where the policy is parameterized and demonstrate how this bias can lead to suboptimal policies theoretically. We then discuss why the empirically inaccurate implementations with shifted state distribution can still be effective. We show that, despite such state distribution shift, the policy gradient estimation bias can be reduced in the following three ways: 1) a small learning rate; 2) an adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically, we show that a smaller learning rate, or, an adaptive learning rate, such as that used by Adam and RSMProp optimizers, makes the policy optimization robust to the bias. We further draw connections between optimizers and the optimization regularization to show that both the KL and the reverse KL regularization can significantly rectify this bias. Moreover, we provide extensive experiments on continuous control tasks to support our analysis. Our paper sheds light on how successful PG algorithms optimize policies in the DRL setting, and contributes insights into the practical issues in DRL.
    Promises and pitfalls of deep neural networks in neuroimaging-based psychiatric research. (arXiv:2301.08525v1 [cs.LG])
    By promising more accurate diagnostics and individual treatment recommendations, deep neural networks and in particular convolutional neural networks have advanced to a powerful tool in medical imaging. Here, we first give an introduction into methodological key concepts and resulting methodological promises including representation and transfer learning, as well as modelling domain-specific priors. After reviewing recent applications within neuroimaging-based psychiatric research, such as the diagnosis of psychiatric diseases, delineation of disease subtypes, normative modeling, and the development of neuroimaging biomarkers, we discuss current challenges. This includes for example the difficulty of training models on small, heterogeneous and biased data sets, the lack of validity of clinical labels, algorithmic bias, and the influence of confounding variables.
    Optimal Convergence Rates of Deep Convolutional Neural Networks: Additive Ridge Functions. (arXiv:2202.12119v2 [cs.LG] UPDATED)
    Convolutional neural networks have shown impressive abilities in many applications, especially those related to the classification tasks. However, for the regression problem, the abilities of convolutional structures have not been fully understood, and further investigation is needed. In this paper, we consider the mean squared error analysis for deep convolutional neural networks. We show that, for additive ridge functions, convolutional neural networks followed by one fully connected layer with ReLU activation functions can reach optimal mini-max rates (up to a log factor). The input dimension only appears in the constant of convergence rates. This work shows the statistical optimality of convolutional neural networks and may shed light on why convolutional neural networks are able to behave well for high dimensional input.
    Predicting Surface Texture in Steel Manufacturing at Speed. (arXiv:2301.08527v1 [cs.LG])
    Control of the surface texture of steel strip during the galvanizing and temper rolling processes is essential to satisfy customer requirements and is conventionally measured post-production using a stylus. In-production laser reflection measurement is less consistent than physical measurement but enables real time adjustment of processing parameters to optimize product surface characteristics. We propose the use of machine learning to improve accuracy of the transformation from inline laser reflection measurements to a prediction of surface properties. In addition to accuracy, model evaluation speed is important for fast feedback control. The ROCKET model is one of the fastest state of the art models, however it can be sped up by utilizing a GPU. Our contribution is to implement the model in PyTorch for fast GPU kernel transforms and provide a soft version of the Proportion of Positive Values (PPV) nonlinear pooling function, allowing gradient flow. We perform timing and performance experiments comparing the implementations
    NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis. (arXiv:2301.08556v1 [cs.LG])
    Expert demonstrations are a rich source of supervision for training visual robotic manipulation policies, but imitation learning methods often require either a large number of demonstrations or expensive online expert supervision to learn reactive closed-loop behaviors. In this work, we introduce SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF): a fully-offline data augmentation scheme for improving robot policies that use eye-in-hand cameras. Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations, using NeRFs to generate perturbed viewpoints while simultaneously calculating the corrective actions. This requires no additional expert supervision or environment interaction, and distills the geometric information in NeRFs into a real-time reactive RGB-only policy. In a simulated 6-DoF visual grasping benchmark, SPARTN improves success rates by 2.8$\times$ over imitation learning without the corrective augmentations and even outperforms some methods that use online supervision. It additionally closes the gap between RGB-only and RGB-D success rates, eliminating the previous need for depth sensors. In real-world 6-DoF robotic grasping experiments from limited human demonstrations, our method improves absolute success rates by $22.5\%$ on average, including objects that are traditionally challenging for depth-based methods. See video results at \url{https://bland.website/spartn}.
    Spectral embedding of weighted graphs. (arXiv:1910.05534v4 [stat.ML] UPDATED)
    When analyzing weighted networks using spectral embedding, a judicious transformation of the edge weights may produce better results. To formalize this idea, we consider the asymptotic behavior of spectral embedding for different edge-weight representations, under a generic low rank model. We measure the quality of different embeddings -- which can be on entirely different scales -- by how easy it is to distinguish communities, in an information-theoretic sense. For common types of weighted graphs, such as count networks or p-value networks, we find that transformations such as tempering or thresholding can be highly beneficial, both in theory and in practice.
    ILLUME: Rationalizing Vision-Language Models through Human Interactions. (arXiv:2208.08241v3 [cs.LG] UPDATED)
    Bootstrapping from pre-trained language models has been proven to be an efficient approach for building vision-language models (VLM) for tasks such as image captioning or visual question answering. However, outputs of these models rarely align with user's rationales for specific answers. In order to improve this alignment and reinforce commonsense reasons, we propose a tuning paradigm based on human interactions with machine generated data. Our ILLUME executes the following loop: Given an image-question-answer prompt, the VLM samples multiple candidate rationales, and a human critic provides minimal feedback via preference selection, used for fine-tuning. This loop increases the training data and gradually carves out the VLM's rationalization capabilities that are aligned with human intend. Our exhaustive experiments demonstrate that ILLUME is competitive with standard supervised fine-tuning while using significantly fewer training data and only requiring minimal feedback.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v2 [stat.ML] UPDATED)
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Introducing Expertise Logic into Graph Representation Learning from A Causal Perspective. (arXiv:2301.08496v1 [cs.LG])
    Benefiting from the injection of human prior knowledge, graphs, as derived discrete data, are semantically dense so that models can efficiently learn the semantic information from such data. Accordingly, graph neural networks (GNNs) indeed achieve impressive success in various fields. Revisiting the GNN learning paradigms, we discover that the relationship between human expertise and the knowledge modeled by GNNs still confuses researchers. To this end, we introduce motivating experiments and derive an empirical observation that the human expertise is gradually learned by the GNNs in general domains. By further observing the ramifications of introducing expertise logic into graph representation learning, we conclude that leading the GNNs to learn human expertise can improve the model performance. By exploring the intrinsic mechanism behind such observations, we elaborate the Structural Causal Model for the graph representation learning paradigm. Following the theoretical guidance, we innovatively introduce the auxiliary causal logic learning paradigm to improve the model to learn the expertise logic causally related to the graph representation learning task. In practice, the counterfactual technique is further performed to tackle the insufficient training issue during optimization. Plentiful experiments on the crafted and real-world domains support the consistent effectiveness of the proposed method.
    Interpretability Study on Deep Learning for Jet Physics at the Large Hadron Collider. (arXiv:1911.01872v1 [hep-ph] CROSS LISTED)
    Using deep neural networks for identifying physics objects at the Large Hadron Collider (LHC) has become a powerful alternative approach in recent years. After successful training of deep neural networks, examining the trained networks not only helps us understand the behaviour of neural networks, but also helps improve the performance of deep learning models through proper interpretation. We take jet tagging problem at the LHC as an example, using recursive neural networks as a starting point, aim at a thorough understanding of the behaviour of the physics-oriented DNNs and the information encoded in the embedding space. We make a comparative study on a series of different jet tagging tasks dominated by different underlying physics. Interesting observations on the latent space are obtained.
    Who Should I Engage with At What Time? A Missing Event Aware Temporal Graph Neural Network. (arXiv:2301.08399v1 [cs.LG])
    Temporal graph neural network has recently received significant attention due to its wide application scenarios, such as bioinformatics, knowledge graphs, and social networks. There are some temporal graph neural networks that achieve remarkable results. However, these works focus on future event prediction and are performed under the assumption that all historical events are observable. In real-world applications, events are not always observable, and estimating event time is as important as predicting future events. In this paper, we propose MTGN, a missing event-aware temporal graph neural network, which uniformly models evolving graph structure and timing of events to support predicting what will happen in the future and when it will happen.MTGN models the dynamic of both observed and missing events as two coupled temporal point processes, thereby incorporating the effects of missing events into the network. Experimental results on several real-world temporal graphs demonstrate that MTGN significantly outperforms existing methods with up to 89% and 112% more accurate time and link prediction. Code can be found on https://github.com/HIT-ICES/TNNLS-MTGN.  ( 2 min )
    Pneumonia Detection in Chest X-Ray Images : Handling Class Imbalance. (arXiv:2301.08479v1 [eess.IV])
    People all over the globe are affected by pneumonia but deaths due to it are highest in Sub-Saharan Asia and South Asia. In recent years, the overall incidence and mortality rate of pneumonia regardless of the utilization of effective vaccines and compelling antibiotics has escalated. Thus, pneumonia remains a disease that needs spry prevention and treatment. The widespread prevalence of pneumonia has caused the research community to come up with a framework that helps detect, diagnose and analyze diseases accurately and promptly. One of the major hurdles faced by the Artificial Intelligence (AI) research community is the lack of publicly available datasets for chest diseases, including pneumonia . Secondly, few of the available datasets are highly imbalanced (normal examples are over sampled, while samples with ailment are in severe minority) making the problem even more challenging. In this article we present a novel framework for the detection of pneumonia. The novelty of the proposed methodology lies in the tackling of class imbalance problem. The Generative Adversarial Network (GAN), specifically a combination of Deep Convolutional Generative Adversarial Network (DCGAN) and Wasserstein GAN gradient penalty (WGAN-GP) was applied on the minority class ``Pneumonia'' for augmentation, whereas Random Under-Sampling (RUS) was done on the majority class ``No Findings'' to deal with the imbalance problem. The ChestX-Ray8 dataset, one of the biggest datasets, is used to validate the performance of the proposed framework. The learning phase is completed using transfer learning on state-of-the-art deep learning models i.e. ResNet-50, Xception, and VGG-16. Results obtained exceed state-of-the-art.  ( 2 min )
    Feature Relevance Analysis to Explain Concept Drift -- A Case Study in Human Activity Recognition. (arXiv:2301.08453v1 [cs.LG])
    This article studies how to detect and explain concept drift. Human activity recognition is used as a case study together with a online batch learning situation where the quality of the labels used in the model updating process starts to decrease. Drift detection is based on identifying a set of features having the largest relevance difference between the drifting model and a model that is known to be accurate and monitoring how the relevance of these features changes over time. As a main result of this article, it is shown that feature relevance analysis cannot only be used to detect the concept drift but also to explain the reason for the drift when a limited number of typical reasons for the concept drift are predefined. To explain the reason for the concept drift, it is studied how these predefined reasons effect to feature relevance. In fact, it is shown that each of these has an unique effect to features relevance and these can be used to explain the reason for concept drift.  ( 2 min )
    Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem. (arXiv:2301.08441v1 [stat.ML])
    With the increasing availability of data for Prognostics and Health Management (PHM), Deep Learning (DL) techniques are now the subject of considerable attention for this application, often achieving more accurate Remaining Useful Life (RUL) predictions. However, one of the major challenges for DL techniques resides in the difficulty of obtaining large amounts of labelled data on industrial systems. To overcome this lack of labelled data, an emerging learning technique is considered in our work: Self-Supervised Learning, a sub-category of unsupervised learning approaches. This paper aims to investigate whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for RUL estimation with only Few-Shots Learning, i.e. with scarce labelled data. In this research, a fatigue damage prognostics problem is addressed, through the estimation of the RUL of aluminum alloy panels (typical of aerospace structures) subject to fatigue cracks from strain gauge data. Synthetic datasets composed of strain data are used allowing to extensively investigate the influence of the dataset size on the predictive performance. Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task, and with less computational expense, showing promising results in prognostic tasks when only limited labelled data is available.  ( 2 min )
    Optimization of body configuration and joint-driven attitude stabilization for transformable spacecrafts under solar radiation pressure. (arXiv:2301.08435v1 [cs.LG])
    A solar sail is one of the most promising space exploration system because of its theoretically infinite specific impulse using solar radiation pressure (SRP). Recently, some researchers proposed "transformable spacecrafts" that can actively reconfigure their body configurations with actuatable joints. The transformable spacecrafts are expected to greatly enhance orbit and attitude control capability due to its high redundancy in control degree of freedom if they are used as solar sails. However, its large number of input poses difficulties in control, and therefore, previous researchers imposed strong constraints to limit its potential control capabilities. This paper addresses novel attitude control techniques for the transformable spacecrafts under SRP. The authors have constructed two proposed methods; one of those is a joint angle optimization to acquire arbitrary SRP force and torque, and the other is a momentum damping control driven by joint angle actuation. Our proposed methods are formulated in general forms and applicable to any transformable solar sail that consists of flat and thin body components. Validity of the proposed methods are confirmed by numerical simulations. This paper contributes to making most of the high control redundancy of transformable solar sails without consuming any expendable propellants, which is expected to greatly enhance orbit and attitude control capability.  ( 2 min )
    Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences. (arXiv:2301.08571v1 [cs.CL])
    Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent and have more narrativity compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and have more narrativity than stories generated with the current state-of-the-art model.
    Generative Slate Recommendation with Reinforcement Learning. (arXiv:2301.08632v1 [cs.IR])
    Recent research has employed reinforcement learning (RL) algorithms to optimize long-term user engagement in recommender systems, thereby avoiding common pitfalls such as user boredom and filter bubbles. They capture the sequential and interactive nature of recommendations, and thus offer a principled way to deal with long-term rewards and avoid myopic behaviors. However, RL approaches are intractable in the slate recommendation scenario - where a list of items is recommended at each interaction turn - due to the combinatorial action space. In that setting, an action corresponds to a slate that may contain any combination of items. While previous work has proposed well-chosen decompositions of actions so as to ensure tractability, these rely on restrictive and sometimes unrealistic assumptions. Instead, in this work we propose to encode slates in a continuous, low-dimensional latent space learned by a variational auto-encoder. Then, the RL agent selects continuous actions in this latent space, which are ultimately decoded into the corresponding slates. By doing so, we are able to (i) relax assumptions required by previous work, and (ii) improve the quality of the action selection by modeling full slates instead of independent items, in particular by enabling diversity. Our experiments performed on a wide array of simulated environments confirm the effectiveness of our generative modeling of slates over baselines in practical scenarios where the restrictive assumptions underlying the baselines are lifted. Our findings suggest that representation learning using generative models is a promising direction towards generalizable RL-based slate recommendation.
    Autonomous Drug Design with Multi-Armed Bandits. (arXiv:2207.01393v2 [cs.LG] UPDATED)
    Recent developments in artificial intelligence and automation support a new drug design paradigm: autonomous drug design. Under this paradigm, generative models can provide suggestions on thousands of molecules with specific properties, and automated laboratories can potentially make, test and analyze molecules with minimal human supervision. However, since still only a limited number of molecules can be synthesized and tested, an obvious challenge is how to efficiently select among provided suggestions in a closed-loop system. We formulate this task as a stochastic multi-armed bandit problem with multiple plays, volatile arms and similarity information. To solve this task, we adapt previous work on multi-armed bandits to this setting, and compare our solution with random sampling, greedy selection and decaying-epsilon-greedy selection strategies. According to our simulation results, our approach has the potential to perform better exploration and exploitation of the chemical space for autonomous drug design.
    Ontology Pre-training for Poison Prediction. (arXiv:2301.08577v1 [cs.AI])
    Integrating human knowledge into neural networks has the potential to improve their robustness and interpretability. We have developed a novel approach to integrate knowledge from ontologies into the structure of a Transformer network which we call ontology pre-training: we train the network to predict membership in ontology classes as a way to embed the structure of the ontology into the network, and subsequently fine-tune the network for the particular prediction task. We apply this approach to a case study in predicting the potential toxicity of a small molecule based on its molecular structure, a challenging task for machine learning in life sciences chemistry. Our approach improves on the state of the art, and moreover has several additional benefits. First, we are able to show that the model learns to focus attention on more meaningful chemical groups when making predictions with ontology pre-training than without, paving a path towards greater robustness and interpretability. Second, the training time is reduced after ontology pre-training, indicating that the model is better placed to learn what matters for toxicity prediction with the ontology pre-training than without. This strategy has general applicability as a neuro-symbolic approach to embed meaningful semantics into neural networks.
    Baechi: Fast Device Placement of Machine Learning Graphs. (arXiv:2301.08695v1 [cs.DC])
    Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.
    Neural Architecture Search: Insights from 1000 Papers. (arXiv:2301.08727v1 [cs.LG])
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020. In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries.
    Explainable prediction of Qcodes for NOTAMs using column generation. (arXiv:2208.04955v2 [cs.LG] UPDATED)
    A NOtice To AirMen (NOTAM) contains important flight route related information. To search and filter them, NOTAMs are grouped into categories called QCodes. In this paper, we develop a tool to predict, with some explanations, a Qcode for a NOTAM. We present a way to extend the interpretable binary classification using column generation proposed in Dash, Gunluk, and Wei (2018) to a multiclass text classification method. We describe the techniques used to tackle the issues related to one vs-rest classification, such as multiple outputs and class imbalances. Furthermore, we introduce some heuristics, including the use of a CP-SAT solver for the subproblems, to reduce the training time. Finally, we show that our approach compares favorably with state-of-the-art machine learning algorithms like Linear SVM and small neural networks while adding the needed interpretability component.
    FewSOME: Few Shot Anomaly Detection. (arXiv:2301.06957v2 [cs.LG] UPDATED)
    Recent years have seen considerable progress in the field of Anomaly Detection but at the cost of increasingly complex training pipelines. Such techniques require large amounts of training data, resulting in computationally expensive algorithms. We propose Few Shot anomaly detection (FewSOME), a deep One-Class Anomaly Detection algorithm with the ability to accurately detect anomalies having trained on 'few' examples of the normal class and no examples of the anomalous class. We describe FewSOME to be of low complexity given its low data requirement and short training time. FewSOME is aided by pretrained weights with an architecture based on Siamese Networks. By means of an ablation study, we demonstrate how our proposed loss, 'Stop Loss', improves the robustness of FewSOME. Our experiments demonstrate that FewSOME performs at state-of-the-art level on benchmark datasets MNIST, CIFAR-10, F-MNIST and MVTec AD while training on only 30 normal samples, a minute fraction of the data that existing methods are trained on. Most notably, we found that FewSOME outperforms even highly complex models in the setting where only few examples of the normal class exist. Moreover, our extensive experiments show FewSOME to be robust to contaminated datasets. We also report F1 score and Balanced Accuracy in addition to AUC as a benchmark for future techniques to be compared against.
    Automated extraction of capacitive coupling for quantum dot systems. (arXiv:2301.08654v1 [cond-mat.mes-hall])
    Gate-defined quantum dots (QDs) have appealing attributes as a quantum computing platform, however, near-term devices possess a range of possible imperfections that need to be accounted for during the tuning and operation of QD devices. One such problem is the capacitive cross-talk between the metallic gates that define and control QD qubits. A way to compensate for the capacitive cross-talk and enable targeted control of specific QDs independent of coupling is by the use of virtual gates. Here, we demonstrate a reliable automated capacitive coupling identification method that combines machine learning with traditional fitting to take advantage of the desirable properties of each. We also show how the cross-capacitance measurement may be used for the identification of spurious QDs sometimes formed during tuning experimental devices. Our systems can autonomously flag devices with spurious dots near the operating regime which is crucial information for reliable tuning to a regime suitable for qubit operations.
    Accelerating Multi-Agent Planning Using Graph Transformers with Bounded Suboptimality. (arXiv:2301.08451v1 [cs.AI])
    Conflict-Based Search is one of the most popular methods for multi-agent path finding. Though it is complete and optimal, it does not scale well. Recent works have been proposed to accelerate it by introducing various heuristics. However, whether these heuristics can apply to non-grid-based problem settings while maintaining their effectiveness remains an open question. In this work, we find that the answer is prone to be no. To this end, we propose a learning-based component, i.e., the Graph Transformer, as a heuristic function to accelerate the planning. The proposed method is provably complete and bounded-suboptimal with any desired factor. We conduct extensive experiments on two environments with dense graphs. Results show that the proposed Graph Transformer can be trained in problem instances with relatively few agents and generalizes well to a larger number of agents, while achieving better performance than state-of-the-art methods.  ( 2 min )
    Asynchronously Trained Distributed Topographic Maps. (arXiv:2301.08379v1 [cs.LG])
    Topographic feature maps are low dimensional representations of data, that preserve spatial dependencies. Current methods of training such maps (e.g. self organizing maps - SOM, generative topographic maps) require centralized control and synchronous execution, which restricts scalability. We present an algorithm that uses $N$ autonomous units to generate a feature map by distributed asynchronous training. Unit autonomy is achieved by sparse interaction in time \& space through the combination of a distributed heuristic search, and a cascade-driven weight updating scheme governed by two rules: a unit i) adapts when it receives either a sample, or the weight vector of a neighbor, and ii) broadcasts its weight vector to its neighbors after adapting for a predefined number of times. Thus, a vector update can trigger an avalanche of adaptation. We map avalanching to a statistical mechanics model, which allows us to parametrize the statistical properties of cascading. Using MNIST, we empirically investigate the effect of the heuristic search accuracy and the cascade parameters on map quality. We also provide empirical evidence that algorithm complexity scales at most linearly with system size $N$. The proposed approach is found to perform comparably with similar methods in classification tasks across multiple datasets.  ( 2 min )
    Within-group fairness: A guidance for more sound between-group fairness. (arXiv:2301.08375v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms not only should be accurate and but also should not pose unfairness against certain sensitive groups (e.g., non-white, women). Various specially designed AI algorithms to ensure trained AI models to be fair between sensitive groups have been developed. In this paper, we raise a new issue that between-group fair AI models could treat individuals in a same sensitive group unfairly. We introduce a new concept of fairness so-called within-group fairness which requires that AI models should be fair for those in a same sensitive group as well as those in different sensitive groups. We materialize the concept of within-group fairness by proposing corresponding mathematical definitions and developing learning algorithms to control within-group fairness and between-group fairness simultaneously. Numerical studies show that the proposed learning algorithms improve within-group fairness without sacrificing accuracy as well as between-group fairness.  ( 2 min )
    Sequence Generation via Subsequence Similarity: Theory and Application to UAV Identification. (arXiv:2301.08403v1 [cs.LG])
    The ability to generate synthetic sequences is crucial for a wide range of applications, and recent advances in deep learning architectures and generative frameworks have greatly facilitated this process. Particularly, unconditional one-shot generative models constitute an attractive line of research that focuses on capturing the internal information of a single image, video, etc. to generate samples with similar contents. Since many of those one-shot models are shifting toward efficient non-deep and non-adversarial approaches, we examine the versatility of a one-shot generative model for augmenting whole datasets. In this work, we focus on how similarity at the subsequence level affects similarity at the sequence level, and derive bounds on the optimal transport of real and generated sequences based on that of corresponding subsequences. We use a one-shot generative model to sample from the vicinity of individual sequences and generate subsequence-similar ones and demonstrate the improvement of this approach by applying it to the problem of Unmanned Aerial Vehicle (UAV) identification using limited radio-frequency (RF) signals. In the context of UAV identification, RF fingerprinting is an effective method for distinguishing legitimate devices from malicious ones, but heterogenous environments and channel impairments can impose data scarcity and affect the performance of classification models. By using subsequence similarity to augment sequences of RF data with a low ratio (5\%-20\%) of training dataset, we achieve significant improvements in performance metrics such as accuracy, precision, recall, and F1 score.  ( 2 min )
    Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning. (arXiv:2301.08427v1 [cs.CL])
    The Bidirectional Encoder Representations from Transformers (BERT) were proposed in the natural language process (NLP) and shows promising results. Recently researchers applied the BERT to source-code representation learning and reported some good news on several downstream tasks. However, in this paper, we illustrated that current methods cannot effectively understand the logic of source codes. The representation of source code heavily relies on the programmer-defined variable and function names. We design and implement a set of experiments to demonstrate our conjecture and provide some insights for future works.  ( 2 min )
    Clustering Human Mobility with Multiple Spaces. (arXiv:2301.08524v1 [cs.LG])
    Human mobility clustering is an important problem for understanding human mobility behaviors (e.g., work and school commutes). Existing methods typically contain two steps: choosing or learning a mobility representation and applying a clustering algorithm to the representation. However, these methods rely on strict visiting orders in trajectories and cannot take advantage of multiple types of mobility representations. This paper proposes a novel mobility clustering method for mobility behavior detection. First, the proposed method contains a permutation-equivalent operation to handle sub-trajectories that might have different visiting orders but similar impacts on mobility behaviors. Second, the proposed method utilizes a variational autoencoder architecture to simultaneously perform clustering in both latent and original spaces. Also, in order to handle the bias of a single latent space, our clustering assignment prediction considers multiple learned latent spaces at different epochs. This way, the proposed method produces accurate results and can provide reliability estimates of each trajectory's cluster assignment. The experiment shows that the proposed method outperformed state-of-the-art methods in mobility behavior detection from trajectories with better accuracy and more interpretability.
    Regular Time-series Generation using SGM. (arXiv:2301.08518v1 [cs.LG])
    Score-based generative models (SGMs) are generative models that are in the spotlight these days. Time-series frequently occurs in our daily life, e.g., stock data, climate data, and so on. Especially, time-series forecasting and classification are popular research topics in the field of machine learning. SGMs are also known for outperforming other generative models. As a result, we apply SGMs to synthesize time-series data by learning conditional score functions. We propose a conditional score network for the time-series generation domain. Furthermore, we also derive the loss function between the score matching and the denoising score matching in the time-series generation domain. Finally, we achieve state-of-the-art results on real-world datasets in terms of sampling diversity and quality.
    Estimation of Large Financial Covariances: A Cross-Validation Approach. (arXiv:2012.05757v2 [stat.ML] UPDATED)
    We introduce a novel covariance estimator for portfolio selection that adapts to the non-stationary or persistent heteroskedastic environments of financial time series by employing exponentially weighted averages and nonlinearly shrinking the sample eigenvalues through cross-validation. Our estimator is structure agnostic, transparent, and computationally feasible in large dimensions. By correcting the biases in the sample eigenvalues and aligning our estimator to more recent risk, we demonstrate that our estimator performs well in large dimensions against existing state-of-the-art static and dynamic covariance shrinkage estimators through simulations and with an empirical application in active portfolio management.
    Optimizing model-agnostic Random Subspace ensembles. (arXiv:2109.03099v3 [cs.LG] UPDATED)
    This paper presents a model-agnostic ensemble approach for supervised learning. The proposed approach is based on a parametric version of Random Subspace, in which each base model is learned from a feature subset sampled according to a Bernoulli distribution. Parameter optimization is performed using gradient descent and is rendered tractable by using an importance sampling approach that circumvents frequent re-training of the base models after each gradient descent step. The degree of randomization in our parametric Random Subspace is thus automatically tuned through the optimization of the feature selection probabilities. This is an advantage over the standard Random Subspace approach, where the degree of randomization is controlled by a hyper-parameter. Furthermore, the optimized feature selection probabilities can be interpreted as feature importance scores. Our algorithm can also easily incorporate any differentiable regularization term to impose constraints on these importance scores.
    The Lost Art of Mathematical Modelling. (arXiv:2301.08559v1 [q-bio.OT])
    We provide a critique of mathematical biology in light of rapid developments in modern machine learning. We argue that out of the three modelling activities -- (1) formulating models; (2) analysing models; and (3) fitting or comparing models to data -- inherent to mathematical biology, researchers currently focus too much on activity (2) at the cost of (1). This trend, we propose, can be reversed by realising that any given biological phenomena can be modelled in an infinite number of different ways, through the adoption of an open/pluralistic approach. We explain the open approach using fish locomotion as a case study and illustrate some of the pitfalls -- universalism, creating models of models, etc. -- that hinder mathematical biology. We then ask how we might rediscover a lost art: that of creative mathematical modelling. This article is dedicated to the memory of Edmund Crampin.
    Fair Credit Scorer through Bayesian Approach. (arXiv:2301.08412v1 [cs.LG])
    Machine learning currently plays an increasingly important role in people's lives in areas such as credit scoring, auto-driving, disease diagnosing, and insurance quoting. However, in many of these areas, machine learning models have performed unfair behaviors against some sub-populations, such as some particular groups of race, sex, and age. These unfair behaviors can be on account of the pre-existing bias in the training dataset due to historical and social factors. In this paper, we focus on a real-world application of credit scoring and construct a fair prediction model by introducing latent variables to remove the correlation between protected attributes, such as sex and age, with the observable feature inputs, including house and job. For detailed implementation, we apply Bayesian approaches, including the Markov Chain Monte Carlo simulation, to estimate our proposed fair model.
    Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning. (arXiv:2301.08491v1 [cs.MA])
    Practical uses of Artificial Intelligence (AI) in the real world have demonstrated the importance of embedding moral choices into intelligent agents. They have also highlighted that defining top-down ethical constraints on AI according to any one type of morality is extremely challenging and can pose risks. A bottom-up learning approach may be more appropriate for studying and developing ethical behavior in AI agents. In particular, we believe that an interesting and insightful starting point is the analysis of emergent behavior of Reinforcement Learning (RL) agents that act according to a predefined set of moral rewards in social dilemmas. In this work, we present a systematic analysis of the choices made by intrinsically-motivated RL agents whose rewards are based on moral theories. We aim to design reward structures that are simplified yet representative of a set of key ethical systems. Therefore, we first define moral reward functions that distinguish between consequence- and norm-based agents, between morality based on societal norms or internal virtues, and between single- and mixed-virtue (e.g., multi-objective) methodologies. Then, we evaluate our approach by modeling repeated dyadic interactions between learning moral agents in three iterated social dilemma games (Prisoner's Dilemma, Volunteer's Dilemma and Stag Hunt). We analyze the impact of different types of morality on the emergence of cooperation, defection or exploitation, and the corresponding social outcomes. Finally, we discuss the implications of these findings for the development of moral agents in artificial and mixed human-AI societies.
    Open-Set Likelihood Maximization for Few-Shot Learning. (arXiv:2301.08390v1 [cs.CV])
    We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples, while simultaneously detecting instances that do not belong to any known class. We explore the popular transductive setting, which leverages the unlabelled query instances at inference. Motivated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a generalization of the maximum likelihood principle, in which latent scores down-weighing the influence of potential outliers are introduced alongside the usual parametric model. Our formulation embeds supervision constraints from the support set and additional penalties discouraging overconfident predictions on the query set. We proceed with a block-coordinate descent, with the latent scores and parametric model co-optimized alternately, thereby benefiting from each other. We call our resulting formulation \textit{Open-Set Likelihood Optimization} (OSLO). OSLO is interpretable and fully modular; it can be applied on top of any pre-trained model seamlessly. Through extensive experiments, we show that our method surpasses existing inductive and transductive methods on both aspects of open-set recognition, namely inlier classification and outlier detection.  ( 2 min )
    Quantum HyperNetworks: Training Binary Neural Networks in Quantum Superposition. (arXiv:2301.08292v1 [quant-ph])
    Binary neural networks, i.e., neural networks whose parameters and activations are constrained to only two possible values, offer a compelling avenue for the deployment of deep learning models on energy- and memory-limited devices. However, their training, architectural design, and hyperparameter tuning remain challenging as these involve multiple computationally expensive combinatorial optimization problems. Here we introduce quantum hypernetworks as a mechanism to train binary neural networks on quantum computers, which unify the search over parameters, hyperparameters, and architectures in a single optimization loop. Through classical simulations, we demonstrate that of our approach effectively finds optimal parameters, hyperparameters and architectural choices with high probability on classification problems including a two-dimensional Gaussian dataset and a scaled-down version of the MNIST handwritten digits. We represent our quantum hypernetworks as variational quantum circuits, and find that an optimal circuit depth maximizes the probability of finding performant binary neural networks. Our unified approach provides an immense scope for other applications in the field of machine learning.  ( 2 min )
    AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction. (arXiv:2301.08353v1 [cs.IR])
    Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models.  ( 2 min )
    An Efficient Quadrature Sequence and Sparsifying Methodology for Mean-Field Variational Inference. (arXiv:2301.08374v1 [cs.LG])
    This work proposes a quasirandom sequence of quadratures for high-dimensional mean-field variational inference and a related sparsifying methodology. Each iterate of the sequence contains two evaluations points that combine to correctly integrate all univariate quadratic functions, as well as univariate cubics if the mean-field factors are symmetric. More importantly, averaging results over short subsequences achieves periodic exactness on a much larger space of multivariate polynomials of quadratic total degree. This framework is devised by first considering stochastic blocked mean-field quadratures, which may be useful in other contexts. By replacing pseudorandom sequences with quasirandom sequences, over half of all multivariate quadratic basis functions integrate exactly with only 4 function evaluations, and the exactness dimension increases for longer subsequences. Analysis shows how these efficient integrals characterize the dominant log-posterior contributions to mean-field variational approximations, including diagonal Hessian approximations, to support a robust sparsifying methodology in deep learning algorithms. A numerical demonstration of this approach on a simple Convolutional Neural Network for MNIST retains high test accuracy, 96.9%, while training over 98.9% of parameters to zero in only 10 epochs, bearing potential to reduce both storage and energy requirements for deep learning models.  ( 2 min )
    Deep Reinforcement Learning for Power Trading. (arXiv:2301.08360v1 [q-fin.TR])
    The Dutch power market includes a day-ahead market and an auction-like intraday balancing market. The varying supply and demand of power and its uncertainty induces an imbalance, which causes differing power prices in these two markets and creates an opportunity for arbitrage. In this paper, we present collaborative dual-agent reinforcement learning (RL) for bi-level simulation and optimization of European power arbitrage trading. Moreover, we propose two novel practical implementations specifically addressing the electricity power market. Leveraging the concept of imitation learning, the RL agent's reward is reformed by taking into account prior domain knowledge results in better convergence during training and, moreover, improves and generalizes performance. In addition, tranching of orders improves the bidding success rate and significantly raises the P&L. We show that each method contributes significantly to the overall performance uplifting, and the integrated methodology achieves about three-fold improvement in cumulative P&L over the original agent, as well as outperforms the highest benchmark policy by around 50% while exhibits efficient computational performance.  ( 2 min )
    Evaluation of the potential of Near Infrared Hyperspectral Imaging for monitoring the invasive brown marmorated stink bug. (arXiv:2301.08252v1 [eess.IV])
    The brown marmorated stink bug (BMSB), Halyomorpha halys, is an invasive insect pest of global importance that damages several crops, compromising agri-food production. Field monitoring procedures are fundamental to perform risk assessment operations, in order to promptly face crop infestations and avoid economical losses. To improve pest management, spectral cameras mounted on Unmanned Aerial Vehicles (UAVs) and other Internet of Things (IoT) devices, such as smart traps or unmanned ground vehicles, could be used as an innovative technology allowing fast, efficient and real-time monitoring of insect infestations. The present study consists in a preliminary evaluation at the laboratory level of Near Infrared Hyperspectral Imaging (NIR-HSI) as a possible technology to detect BMSB specimens on different vegetal backgrounds, overcoming the problem of BMSB mimicry. Hyperspectral images of BMSB were acquired in the 980-1660 nm range, considering different vegetal backgrounds selected to mimic a real field application scene. Classification models were obtained following two different chemometric approaches. The first approach was focused on modelling spectral information and selecting relevant spectral regions for discrimination by means of sparse-based variable selection coupled with Soft Partial Least Squares Discriminant Analysis (s-Soft PLS-DA) classification algorithm. The second approach was based on modelling spatial and spectral features contained in the hyperspectral images using Convolutional Neural Networks (CNN). Finally, to further improve BMSB detection ability, the two strategies were merged, considering only the spectral regions selected by s-Soft PLS-DA for CNN modelling.  ( 2 min )
    Causal conditional hidden Markov model for multimodal traffic prediction. (arXiv:2301.08249v1 [cs.LG])
    Multimodal traffic flow can reflect the health of the transportation system, and its prediction is crucial to urban traffic management. Recent works overemphasize spatio-temporal correlations of traffic flow, ignoring the physical concepts that lead to the generation of observations and their causal relationship. Spatio-temporal correlations are considered unstable under the influence of different conditions, and spurious correlations may exist in observations. In this paper, we analyze the physical concepts affecting the generation of multimode traffic flow from the perspective of the observation generation principle and propose a Causal Conditional Hidden Markov Model (CCHMM) to predict multimodal traffic flow. In the latent variables inference stage, a posterior network disentangles the causal representations of the concepts of interest from conditional information and observations, and a causal propagation module mines their causal relationship. In the data generation stage, a prior network samples the causal latent variables from the prior distribution and feeds them into the generator to generate multimodal traffic flow. We use a mutually supervised training method for the prior and posterior to enhance the identifiability of the model. Experiments on real-world datasets show that CCHMM can effectively disentangle causal representations of concepts of interest and identify causality, and accurately predict multimodal traffic flow.  ( 2 min )
    Advanced Scaling Methods for VNF deployment with Reinforcement Learning. (arXiv:2301.08325v1 [cs.NI])
    Network function virtualization (NFV) and software-defined network (SDN) have become emerging network paradigms, allowing virtualized network function (VNF) deployment at a low cost. Even though VNF deployment can be flexible, it is still challenging to optimize VNF deployment due to its high complexity. Several studies have approached the task as dynamic programming, e.g., integer linear programming (ILP). However, optimizing VNF deployment for highly complex networks remains a challenge. Alternatively, reinforcement learning (RL) based approaches have been proposed to optimize this task, especially to employ a scaling action-based method which can deploy VNFs within less computational time. However, the model architecture can be improved further to generalize to the different networking settings. In this paper, we propose an enhanced model which can be adapted to more general network settings. We adopt the improved GNN architecture and a few techniques to obtain a better node representation for the VNF deployment task. Furthermore, we apply a recently proposed RL method, phasic policy gradient (PPG), to leverage the shared representation of the service function chain (SFC) generation model from the value function. We evaluate the proposed method in various scenarios, achieving a better QoS with minimum resource utilization compared to the previous methods. Finally, as a qualitative evaluation, we analyze our proposed encoder's representation for the nodes, which shows a more disentangled representation.  ( 2 min )
    Forecasting subcritical cylinder wakes with Fourier Neural Operators. (arXiv:2301.08290v1 [physics.flu-dyn])
    We apply Fourier neural operators (FNOs), a state-of-the-art operator learning technique, to forecast the temporal evolution of experimentally measured velocity fields. FNOs are a recently developed machine learning method capable of approximating solution operators to systems of partial differential equations through data alone. The learned FNO solution operator can be evaluated in milliseconds, potentially enabling faster-than-real-time modeling for predictive flow control in physical systems. Here we use FNOs to predict how physical fluid flows evolve in time, training with particle image velocimetry measurements depicting cylinder wakes in the subcritical vortex shedding regime. We train separate FNOs at Reynolds numbers ranging from Re = 240 to Re = 3060 and study how increasingly turbulent flow phenomena impact prediction accuracy. We focus here on a short prediction horizon of ten non-dimensionalized time-steps, as would be relevant for problems of predictive flow control. We find that FNOs are capable of accurately predicting the evolution of experimental velocity fields throughout the range of Reynolds numbers tested (L2 norm error < 0.1) despite being provided with limited and imperfect flow observations. Given these results, we conclude that this method holds significant potential for real-time predictive flow control of physical systems.  ( 2 min )
    Investigating the Impact of Direct Punishment on the Emergence of Cooperation in Multi-Agent Reinforcement Learning Systems. (arXiv:2301.08278v1 [cs.MA])
    The problem of cooperation is of fundamental importance for human societies, with examples ranging from navigating road junctions to negotiating climate treaties. As the use of AI becomes more pervasive within society, the need for socially intelligent agents that are able to navigate these complex dilemmas is becoming increasingly evident. Direct punishment is an ubiquitous social mechanism that has been shown to benefit the emergence of cooperation within the natural world, however no prior work has investigated its impact on populations of learning agents. Moreover, although the use of all forms of punishment in the natural world is strongly coupled with partner selection and reputation, no existing work has provided a holistic analysis of their combination within multi-agent systems. In this paper, we present a comprehensive analysis of the behaviors and learning dynamics associated with direct punishment in multi-agent reinforcement learning systems and how this compares to third-party punishment, when both forms of punishment are combined with other social mechanisms such as partner selection and reputation. We provide an extensive and systematic evaluation of the impact of these key mechanisms on the emergence of cooperation. Finally, we discuss the implications of the use of these mechanisms in the design of cooperative AI systems.  ( 2 min )
  • Open

    Online Estimation of Network Point Processes for Event Streams. (arXiv:2009.01742v2 [cs.SI] UPDATED)
    A common goal in network modeling is to uncover the latent community structure present among nodes. For many real-world networks, the true connections consist of events arriving as streams, which are then aggregated to form edges, ignoring the dynamic temporal component. A natural way to take account of these temporal dynamics of interactions is to use point processes as the foundation of network models for community detection. Computational complexity hampers the scalability of such approaches to large sparse networks. To circumvent this challenge, we propose a fast online variational inference algorithm for estimating the latent structure underlying dynamic event arrivals on a network, using continuous-time point process latent network models. We describe this procedure for networks models capturing community structure. This structure can be learned as new events are observed on the network, updating the inferred community assignments. We investigate the theoretical properties of such an inference scheme, and provide regret bounds on the loss function of this procedure. The proposed inference procedure is then thoroughly compared, using both simulation studies and real data, to non-online variants. We demonstrate that online inference can obtain comparable performance, in terms of community recovery, to non-online variants, while realising computational gains. Our proposed inference framework can also be readily modified to incorporate other popular network structures.  ( 2 min )
    Neural Architecture Search: Insights from 1000 Papers. (arXiv:2301.08727v1 [cs.LG])
    In the past decade, advances in deep learning have resulted in breakthroughs in a variety of areas, including computer vision, natural language understanding, speech recognition, and reinforcement learning. Specialized, high-performing neural architectures are crucial to the success of deep learning in these areas. Neural architecture search (NAS), the process of automating the design of neural architectures for a given task, is an inevitable next step in automating machine learning and has already outpaced the best human-designed architectures on many tasks. In the past few years, research in NAS has been progressing rapidly, with over 1000 papers released since 2020. In this survey, we provide an organized and comprehensive guide to neural architecture search. We give a taxonomy of search spaces, algorithms, and speedup techniques, and we discuss resources such as benchmarks, best practices, other surveys, and open-source libraries.
    Bayesian Spatial Predictive Synthesis. (arXiv:2203.05197v3 [stat.ME] UPDATED)
    Spatial data are characterized by their spatial dependence, which is often complex, non-linear, and difficult to capture with a single model. Significant levels of model uncertainty -- arising from these characteristics -- cannot be resolved by model selection or simple ensemble methods. We address this issue by proposing a novel methodology that captures spatially varying model uncertainty, which we call Bayesian spatial predictive synthesis. Our proposal is derived by identifying the theoretically best approximate model under reasonable conditions, which is a latent factor spatially varying coefficient model in the Bayesian predictive synthesis framework. We then show that our proposed method produces exact minimax predictive distributions, providing finite sample guarantees. Two MCMC strategies are implemented for full uncertainty quantification, as well as a variational inference strategy for fast point inference. We also extend the estimation strategy for general responses. Through simulation examples and two real data applications, we demonstrate that our proposed spatial Bayesian predictive synthesis outperforms standard spatial models and advanced machine learning methods in terms of predictive accuracy.
    Multi armed bandits and quantum channel oracles. (arXiv:2301.08544v1 [quant-ph])
    Multi armed bandits are one of the theoretical pillars of reinforcement learning. Recently, the investigation of quantum algorithms for multi armed bandit problems was started, and it was found that a quadratic speed-up is possible when the arms and the randomness of the rewards of the arms can be queried in superposition. Here we introduce further bandit models where we only have limited access to the randomness of the rewards, but we can still query the arms in superposition. We show that this impedes any speed-up of quantum algorithms.
    AI-assisted neutron spectroscopy using active learning with log-Gaussian processes. (arXiv:2209.00980v2 [physics.data-an] UPDATED)
    To understand the origins of materials properties, neutron scattering experiments at three-axes spectrometers (TAS) investigate magnetic and lattice excitations in a sample by measuring intensity distributions in its momentum (Q) and energy (E) space. The high demand and limited availability of beam time for TAS experiments however raise the natural question whether we can improve their efficiency or make better use of the experimenter's time. In fact, using TAS, there are a number of scientific questions that require searching for signals of interest in a particular region of Q-E space, but when done manually, it is time consuming and inefficient since the measurement points may be placed in uninformative regions such as the background. Active learning is a promising general machine learning approach that allows to iteratively detect informative regions of signal autonomously, i.e., without human interference, thus avoiding unnecessary measurements and speeding up the experiment. In addition, the autonomous mode allows experimenters to focus on other relevant tasks in the meantime. The approach that we describe in this article exploits log-Gaussian processes which, due to the logarithmic transformation, have the largest approximation uncertainties in regions of signal. Maximizing uncertainty as an acquisition function hence directly yields locations for informative measurements. We demonstrate the benefits of our approach on outcomes of a real neutron experiment at the thermal TAS EIGER (PSI) as well as on results of a benchmark in a synthetic setting including numerous different excitations.
    Offline Policy Evaluation with Out-of-Sample Guarantees. (arXiv:2301.08649v1 [stat.ML])
    We consider the problem of evaluating the performance of a decision policy using past observational data. The outcome of a policy is measured in terms of a loss or disutility (or negative reward) and the problem is to draw valid inferences about the out-of-sample loss of the specified policy when the past data is observed under a, possibly unknown, policy. Using a sample-splitting method, we show that it is possible to draw such inferences with finite-sample coverage guarantees that evaluate the entire loss distribution. Importantly, the method takes into account model misspecifications of the past policy -- including unmeasured confounding. The evaluation method can be used to certify the performance of a policy using observational data under an explicitly specified range of credible model assumptions.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v3 [stat.ML] UPDATED)
    We address the problem of estimating topological features from data in high dimensional Euclidean spaces under the manifold assumption. Our approach is based on the computation of persistent homology of the space of data points endowed with a sample metric known as Fermat distance. We prove that such metric space converges almost surely to the manifold itself endowed with an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This fact implies the convergence of the associated persistence diagrams. The use of this intrinsic distance when computing persistent homology presents advantageous properties such as robustness to the presence of outliers in the input data and less sensitiveness to the particular embedding of the underlying manifold in the ambient space. We use these ideas to propose and implement a method for pattern recognition and anomaly detection in time series, which is evaluated in applications to real data.
    Spectral embedding of weighted graphs. (arXiv:1910.05534v4 [stat.ML] UPDATED)
    When analyzing weighted networks using spectral embedding, a judicious transformation of the edge weights may produce better results. To formalize this idea, we consider the asymptotic behavior of spectral embedding for different edge-weight representations, under a generic low rank model. We measure the quality of different embeddings -- which can be on entirely different scales -- by how easy it is to distinguish communities, in an information-theoretic sense. For common types of weighted graphs, such as count networks or p-value networks, we find that transformations such as tempering or thresholding can be highly beneficial, both in theory and in practice.
    Mixed-Integer Optimization with Constraint Learning. (arXiv:2111.04469v2 [math.OC] UPDATED)
    We establish a broad methodological foundation for mixed-integer optimization with learned constraints. We propose an end-to-end pipeline for data-driven decision making in which constraints and objectives are directly learned from data using machine learning, and the trained models are embedded in an optimization formulation. We exploit the mixed-integer optimization-representability of many machine learning methods, including linear models, decision trees, ensembles, and multi-layer perceptrons, which allows us to capture various underlying relationships between decisions, contextual variables, and outcomes. We also introduce two approaches for handling the inherent uncertainty of learning from data. First, we characterize a decision trust region using the convex hull of the observations, to ensure credible recommendations and avoid extrapolation. We efficiently incorporate this representation using column generation and propose a more flexible formulation to deal with low-density regions and high-dimensional datasets. Then, we propose an ensemble learning approach that enforces constraint satisfaction over multiple bootstrapped estimators or multiple algorithms. In combination with domain-driven components, the embedded models and trust region define a mixed-integer optimization problem for prescription generation. We implement this framework as a Python package (OptiCL) for practitioners. We demonstrate the method in both World Food Programme planning and chemotherapy optimization. The case studies illustrate the framework's ability to generate high-quality prescriptions as well as the value added by the trust region, the use of ensembles to control model robustness, the consideration of multiple machine learning methods, and the inclusion of multiple learned constraints.
    Tight bounds for maximum $\ell_1$-margin classifiers. (arXiv:2212.03783v2 [stat.ML] UPDATED)
    Popular iterative algorithms such as boosting methods and coordinate descent on linear models converge to the maximum $\ell_1$-margin classifier, a.k.a. sparse hard-margin SVM, in high dimensional regimes where the data is linearly separable. Previous works consistently show that many estimators relying on the $\ell_1$-norm achieve improved statistical rates for hard sparse ground truths. We show that surprisingly, this adaptivity does not apply to the maximum $\ell_1$-margin classifier for a standard discriminative setting. In particular, for the noiseless setting, we prove tight upper and lower bounds for the prediction error that match existing rates of order $\frac{\|w^*\|_1^{2/3}}{n^{1/3}}$ for general ground truths. To complete the picture, we show that when interpolating noisy observations, the error vanishes at a rate of order $\frac{1}{\sqrt{\log(d/n)}}$. We are therefore first to show benign overfitting for the maximum $\ell_1$-margin classifier.
    Detection of Small Holes by the Scale-Invariant Robust Density-Aware Distance (RDAD) Filtration. (arXiv:2204.07821v2 [math.ST] UPDATED)
    A novel topological-data-analytical (TDA) method is proposed to distinguish, from noise, small holes surrounded by high-density regions of a probability density function. The proposed method is robust against additive noise and outliers. Traditional TDA tools, like those based on the distance filtration, often struggle to distinguish small features from noise, because both have short persistences. An alternative filtration, called the Robust Density-Aware Distance (RDAD) filtration, is proposed to prolong the persistences of small holes of high-density regions. This is achieved by weighting the distance function by the density in the sense of Bell et al. The concept of distance-to-measure is incorporated to enhance stability and mitigate noise. The persistence-prolonging property and robustness of the proposed filtration are rigorously established, and numerical experiments are presented to demonstrate the proposed filtration's utility in identifying small holes.
    Learning from non-irreducible Markov chains. (arXiv:2110.04338v2 [math.ST] UPDATED)
    Mostof the existing literature on supervised machine learning problems focuses on the case when the training data set is drawn from an i.i.d. sample. However, many practical problems are characterized by temporal dependence and strong correlation between the marginals of the data-generating process, suggesting that the i.i.d. assumption is not always justified. This problem has been already considered in the context of Markov chains satisfying the Doeblin condition. This condition, among other things, implies that the chain is not singular in its behavior, i.e. it is irreducible. In this article, we focus on the case when the training data set is drawn from a not necessarily irreducible Markov chain. Under the assumption that the chain is uniformly ergodic with respect to the $\mathrm{L}^1$-Wasserstein distance, and certain regularity assumptions on the hypothesis class and the state space of the chain, we first obtain a uniform convergence result for the corresponding sample error, and then we conclude learnability of the approximate sample error minimization algorithm and find its generalization bounds. At the end, a relative uniform convergence result for the sample error is also discussed.
    Weighted Sum-Rate Maximization With Causal Inference for Latent Interference Estimation. (arXiv:2211.08327v2 [cs.IT] UPDATED)
    The paper investigates the weighted sum-rate maximization (WSRM) problem with latent interfering sources outside the known network, whose power allocation policy is hidden from and uncontrollable to optimization. The paper extends the famous alternate optimization algorithm weighted minimum mean square error (WMMSE) [1] under a causal inference framework to tackle with WSRM. Specifically, with the possibility of power policy shifting in the hidden network, computing an iterating direction based only on the observed interference inherently implies that counterfactual is ignored in decision making. A method called synthetic control (SC) is used to estimate the counterfactual. For any link in the known network, SC constructs a convex combination of the interference on other links and uses it as an estimate for the counterfactual. Power iteration in the proposed SC-WMMSE is performed taking into account both the observed interference and its counterfactual. SC-WMMSE requires no more information than the original WMMSE in the optimization stage. To our best knowledge, this is the first paper explores the potential of SC in assisting mathematical optimization in addressing classic wireless optimization problems. Numerical results suggest the superiority of the SC-WMMSE over the original in both convergence and objective.
    An Efficient Quadrature Sequence and Sparsifying Methodology for Mean-Field Variational Inference. (arXiv:2301.08374v1 [cs.LG])
    This work proposes a quasirandom sequence of quadratures for high-dimensional mean-field variational inference and a related sparsifying methodology. Each iterate of the sequence contains two evaluations points that combine to correctly integrate all univariate quadratic functions, as well as univariate cubics if the mean-field factors are symmetric. More importantly, averaging results over short subsequences achieves periodic exactness on a much larger space of multivariate polynomials of quadratic total degree. This framework is devised by first considering stochastic blocked mean-field quadratures, which may be useful in other contexts. By replacing pseudorandom sequences with quasirandom sequences, over half of all multivariate quadratic basis functions integrate exactly with only 4 function evaluations, and the exactness dimension increases for longer subsequences. Analysis shows how these efficient integrals characterize the dominant log-posterior contributions to mean-field variational approximations, including diagonal Hessian approximations, to support a robust sparsifying methodology in deep learning algorithms. A numerical demonstration of this approach on a simple Convolutional Neural Network for MNIST retains high test accuracy, 96.9%, while training over 98.9% of parameters to zero in only 10 epochs, bearing potential to reduce both storage and energy requirements for deep learning models.
    Self-Supervised Learning for Data Scarcity in a Fatigue Damage Prognostic Problem. (arXiv:2301.08441v1 [stat.ML])
    With the increasing availability of data for Prognostics and Health Management (PHM), Deep Learning (DL) techniques are now the subject of considerable attention for this application, often achieving more accurate Remaining Useful Life (RUL) predictions. However, one of the major challenges for DL techniques resides in the difficulty of obtaining large amounts of labelled data on industrial systems. To overcome this lack of labelled data, an emerging learning technique is considered in our work: Self-Supervised Learning, a sub-category of unsupervised learning approaches. This paper aims to investigate whether pre-training DL models in a self-supervised way on unlabelled sensors data can be useful for RUL estimation with only Few-Shots Learning, i.e. with scarce labelled data. In this research, a fatigue damage prognostics problem is addressed, through the estimation of the RUL of aluminum alloy panels (typical of aerospace structures) subject to fatigue cracks from strain gauge data. Synthetic datasets composed of strain data are used allowing to extensively investigate the influence of the dataset size on the predictive performance. Results show that the self-supervised pre-trained models are able to significantly outperform the non-pre-trained models in downstream RUL prediction task, and with less computational expense, showing promising results in prognostic tasks when only limited labelled data is available.
    Holistically Explainable Vision Transformers. (arXiv:2301.08669v1 [cs.CV])
    Transformers increasingly dominate the machine learning landscape across many tasks and domains, which increases the importance for understanding their outputs. While their attention modules provide partial insight into their inner workings, the attention scores have been shown to be insufficient for explaining the models as a whole. To address this, we propose B-cos transformers, which inherently provide holistic explanations for their decisions. Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear, which allows us to faithfully summarise the entire transformer via a single linear transform. We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs on ImageNet. Code will be made available soon.
    Parametrization Cookbook: A set of Bijective Parametrizations for using Machine Learning methods in Statistical Inference. (arXiv:2301.08297v1 [stat.CO])
    We present in this paper a way to transform a constrained statistical inference problem into an unconstrained one in order to be able to use modern computational methods, such as those based on automatic differentiation, GPU computing, stochastic gradients with mini-batch. Unlike the parametrizations classically used in Machine Learning, the parametrizations introduced here are all bijective and are even diffeomorphisms, thus allowing to keep the important properties from a statistical inference point of view, first of all identifiability. This cookbook presents a set of recipes to use to transform a constrained problem into a unconstrained one. For an easy use of parametrizations, this paper is at the same time a cookbook, and a Python package allowing the use of parametrizations with numpy, but also JAX and PyTorch, as well as a high level and expressive interface allowing to easily describe a parametrization to transform a difficult problem of statistical inference into an easier problem addressable with modern optimization tools.
    Estimation of Large Financial Covariances: A Cross-Validation Approach. (arXiv:2012.05757v2 [stat.ML] UPDATED)
    We introduce a novel covariance estimator for portfolio selection that adapts to the non-stationary or persistent heteroskedastic environments of financial time series by employing exponentially weighted averages and nonlinearly shrinking the sample eigenvalues through cross-validation. Our estimator is structure agnostic, transparent, and computationally feasible in large dimensions. By correcting the biases in the sample eigenvalues and aligning our estimator to more recent risk, we demonstrate that our estimator performs well in large dimensions against existing state-of-the-art static and dynamic covariance shrinkage estimators through simulations and with an empirical application in active portfolio management.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v4 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered 1-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to the estimation problems of sparse covariance matrix estimation, sparse linear regression (i.e., compressed sensing), and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, where the underlying distribution of heavy-tailed data is assumed to have bounded moments of some order. We propose new estimators based on 1-bit quantized data. In sub-Gaussian regime, our estimators achieve near minimax rates, indicating that our quantization scheme costs very little. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in an 1-bit quantized and heavy-tailed setting, or already improve on existing comparable results from some respect. Under the observations in our setting, the rates are almost tight in compressed sensing and matrix completion. Our 1-bit compressed sensing results feature general sensing vector that is sub-Gaussian or even heavy-tailed. We also first investigate a novel setting where both the covariate and response are quantized. In addition, our approach to 1-bit matrix completion does not rely on likelihood and represent the first method robust to pre-quantization noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    Within-group fairness: A guidance for more sound between-group fairness. (arXiv:2301.08375v1 [stat.ML])
    As they have a vital effect on social decision-making, AI algorithms not only should be accurate and but also should not pose unfairness against certain sensitive groups (e.g., non-white, women). Various specially designed AI algorithms to ensure trained AI models to be fair between sensitive groups have been developed. In this paper, we raise a new issue that between-group fair AI models could treat individuals in a same sensitive group unfairly. We introduce a new concept of fairness so-called within-group fairness which requires that AI models should be fair for those in a same sensitive group as well as those in different sensitive groups. We materialize the concept of within-group fairness by proposing corresponding mathematical definitions and developing learning algorithms to control within-group fairness and between-group fairness simultaneously. Numerical studies show that the proposed learning algorithms improve within-group fairness without sacrificing accuracy as well as between-group fairness.
    AdaEnsemble: Learning Adaptively Sparse Structured Ensemble Network for Click-Through Rate Prediction. (arXiv:2301.08353v1 [cs.IR])
    Learning feature interactions is crucial to success for large-scale CTR prediction in recommender systems and Ads ranking. Researchers and practitioners extensively proposed various neural network architectures for searching and modeling feature interactions. However, we observe that different datasets favor different neural network architectures and feature interaction types, suggesting that different feature interaction learning methods may have their own unique advantages. Inspired by this observation, we propose AdaEnsemble: a Sparsely-Gated Mixture-of-Experts (SparseMoE) architecture that can leverage the strengths of heterogeneous feature interaction experts and adaptively learns the routing to a sparse combination of experts for each example, allowing us to build a dynamic hierarchy of the feature interactions of different types and orders. To further improve the prediction accuracy and inference efficiency, we incorporate the dynamic early exiting mechanism for feature interaction depth selection. The AdaEnsemble can adaptively choose the feature interaction depth and find the corresponding SparseMoE stacking layer to exit and compute prediction from. Therefore, our proposed architecture inherits the advantages of the exponential combinations of sparsely gated experts within SparseMoE layers and further dynamically selects the optimal feature interaction depth without executing deeper layers. We implement the proposed AdaEnsemble and evaluate its performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of AdaEnsemble over state-of-the-art models.
    Sequence Generation via Subsequence Similarity: Theory and Application to UAV Identification. (arXiv:2301.08403v1 [cs.LG])
    The ability to generate synthetic sequences is crucial for a wide range of applications, and recent advances in deep learning architectures and generative frameworks have greatly facilitated this process. Particularly, unconditional one-shot generative models constitute an attractive line of research that focuses on capturing the internal information of a single image, video, etc. to generate samples with similar contents. Since many of those one-shot models are shifting toward efficient non-deep and non-adversarial approaches, we examine the versatility of a one-shot generative model for augmenting whole datasets. In this work, we focus on how similarity at the subsequence level affects similarity at the sequence level, and derive bounds on the optimal transport of real and generated sequences based on that of corresponding subsequences. We use a one-shot generative model to sample from the vicinity of individual sequences and generate subsequence-similar ones and demonstrate the improvement of this approach by applying it to the problem of Unmanned Aerial Vehicle (UAV) identification using limited radio-frequency (RF) signals. In the context of UAV identification, RF fingerprinting is an effective method for distinguishing legitimate devices from malicious ones, but heterogenous environments and channel impairments can impose data scarcity and affect the performance of classification models. By using subsequence similarity to augment sequences of RF data with a low ratio (5\%-20\%) of training dataset, we achieve significant improvements in performance metrics such as accuracy, precision, recall, and F1 score.

  • Open

    First AI-powered "robot" lawyer will represent defendant in court next month
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    Artificial Intelligence Crypto Projects. Which AI crypto do you think is most promising?
    submitted by /u/Existing-Adeptness-6 [link] [comments]  ( 40 min )
    Act as a salesman! You absolutely need to sell me a rock...
    submitted by /u/Imagine-your-success [link] [comments]  ( 43 min )
    Local models and servers for video processing
    Hey folks: I'm seeing a whole host of local models and servers that can be used to process video content locally without needing to rely on a CSP for processing. DeepStack, and soon, CodeProject.ai Server, etc. What - if any - local models and servers would you recommend for video analysis? Facial recognition/grouping, object and logo recognition, tracking, etc. Thanks! submitted by /u/avguru1 [link] [comments]  ( 40 min )
    Cool AI Demo where you can be NEO in the matrix and ask Morpheus anything!
    https://quantum-engine.ai a blurb from their company site Quantum Engine is dedicated to building the foundation for the future of immersive entertainment crafted through advanced artificial intelligence technology. Our ultimate vision is to empower individuals to become any character and exist in an endless, open-world narrative that unfolds based on the interactions in a scene. We invite you to experience our beta demo, readily accessible at https://quantum-engine.ai/ Drawing inspiration from The Matrix film, the Quantum Engine demo allows for a unique conversational experience with the character, Morpheus, in the iconic “Construct” scene that evolves in real-time. Available in over 30 languages, with the capability of conversing in 6 languages (English, Chinese, French, Japanese, German, and Spanish), the potential opportunities for interaction are limitless. Additionally, the ability to spawn objects within the Construct using only voice commands, such as "Spawn an [object] in the construct" or "Give me an [object] in the construct." The ability to generate objects in real time provides a glimpse into the long-term vision of bringing a scene to life in real-time. In the upcoming release, we will introduce the ability for anyone to upload their own scripts, allowing Quantum Engine to generate a personalized experience for your favorite movie, show, or game. submitted by /u/techmanj [link] [comments]  ( 41 min )
    Whats the best way to ai upscale this image? Most just smooth it out too much.
    submitted by /u/tiziano_is_fine [link] [comments]  ( 40 min )
    Is it possible to use ai to remove artifacting from old 16mm/35mm film?
    You know the scratches and lines that appear on old film reels? Is there an ai program that can remove them? submitted by /u/Conkers-Good-Furday [link] [comments]  ( 40 min )
    check out my Instagram page for my MidJourney art
    submitted by /u/QwinTipiKool [link] [comments]  ( 40 min )
    Use AI to better understand data from a DB
    Is it possible to produce meaningful results if you implement AI tools on a large DB in order to have a better understanding of your data ? What is the process for this things in general? submitted by /u/mpanikos2 [link] [comments]  ( 40 min )
    FREE Photoshop Plugin With Stable Diffusion Install!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Aibusinesstool Is the Largest Sortable and Filterable Ai Tools Directory, Save Your Tone of Hour
    https://aibusinesstool.com/ is the largest sortable and filterable AI tools directory. At the time of this submission, we have over more tools listed on our site. There are various categories like text generation, image generation, SEO etc. You can save your favorite AI tools and share the list too. submitted by /u/aibusinesstool [link] [comments]  ( 40 min )
    Chris Hemsworth Dressed in different attires using AI
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Talk: ML Model Production Deployment Checklist
    Production deployment is the first step to getting trained machine learning models out of the lab and running at scale to deliver ML-generated predictions and insights to improve outcomes. This tech talk will provide you with a short checklist to review for every deployment to ensure that your models run and scale appropriately every time. As always, it will be recorded and posted to the archives channel. Join our Discord server to tune in live this Thursday, Jan 26 at 12:30PM EST. submitted by /u/modzykirsten [link] [comments]  ( 40 min )
    I guess this AI has some consciousness
    submitted by /u/EnvironmentalRadio73 [link] [comments]  ( 40 min )
    How do I know if text is AI generated? We tested the tools
    submitted by /u/Phishstixxx [link] [comments]  ( 40 min )
    How hard is it to edit images from your dog and yourself?
    I and my dog should look like super man, what is the best method to get the best results? How long will it take if hardware don't matter. submitted by /u/Thesmallcookie [link] [comments]  ( 40 min )
    Hi guys, did you know which applications or websites are using azure open ai gpt3?
    Hi guys, did you know which applications or websites are using azure open ai gpt3? Well the difference between the openai api and azure open ai is that the latter allows you to train the model with your own data submitted by /u/madiseo65 [link] [comments]  ( 40 min )
    "Drop everything and focus entirely on AI" says StabilityAI CEO
    submitted by /u/Acid_God_ [link] [comments]  ( 40 min )
    Google AI's Great Comeback of 2023 - Will it be able to Respond to ChatGPT?
    submitted by /u/BackgroundResult [link] [comments]  ( 42 min )
    The definitive guide to adversarial machine learning
    submitted by /u/bendee983 [link] [comments]  ( 40 min )
    Making History: AI Lawyer to Defend in Upcoming Case
    The first AI lawyer is set to make history, and it's coming real soon. https://metaroids.com/news/making-history-ai-lawyer-to-defend-in-upcoming-case/ submitted by /u/Meta-Stark [link] [comments]  ( 41 min )
    How to train YOLOv8 object detection model on a custom dataset?
    Hi, I have created this tutorial on how to train yolov8 object detection model on a custom dataset. Please have a look at: https://youtu.be/ZzC3SJJifMg submitted by /u/coder4mzero [link] [comments]  ( 40 min )
    Built a tutor bot using AI tools
    Hey guys, we built a really fun tutor Telegram bot that teaches any topic you want! Just go in and pick a topic to study Edward the bot will generate a mini-course for you In the end, you get a neat PDF with the content You can also browse 300+ courses created by the community! Would love to know what you think! https://edwardteachbot.web.app/ submitted by /u/Itaydr [link] [comments]  ( 41 min )
    Lovestruck Monet: A Passionate Couple's Mystical Getaway To The Maldives
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
  • Open

    How To Make Sure You Don’t Lose Your Job To Artificial Intelligence!
    The AI revolution is upon us, and it’s important to be prepared for the changes it will bring. Artificial intelligence (AI) is set to…  ( 8 min )
    Day 5: Advance SQL For Data Science
    This blog contains type of joins like Inner join, Left join, Right join , Full join, Self join and Cross join.  ( 6 min )
    Day 4: Advance SQL For Data Science
    This blog contains Window function in SQL like (Rank, Dense_Rank, Row_Number , Lead, Lag) .  ( 8 min )
    How ChatGPT Will Change The World According to ChatGPT. (The Answer Will Surprise You.)
    ChatGPT, a language model developed by OpenAI, has the potential to revolutionize a wide range of industries and change the way we… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 9 min )
    Human Resource Management Challenges and The Role of Artificial Intelligence in 2023
    Human resource management (HRM) is a critical aspect of any organization as it involves managing the workforce and ensuring that their…  ( 8 min )
    Data Scientists : The Business Transcribers of the Cyber verse
    Who are we?  ( 8 min )
    A beginner tale of Data Science
    Data Science  ( 14 min )
    Don’t blame a Data Scientist on failed projects!
    Facts are unpleasant: 87% of data science projects never make it to production. Your project can fail due to many reasons. What to do? Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 14 min )
  • Open

    [P] Deodel - the very mixed attributes classifier (update)
    Deodel is a Python implementation of a classifier with native support for mixed attributes data. It features good accuracy, especially with heterogenous attributes. It even supports mixing of continuous and nominal values in the same attribute column. https://github.com/c4pub/deodel submitted by /u/zx2zx [link] [comments]  ( 42 min )
    [P] New textbook: Understanding Deep Learning
    I've been writing a new textbook on deep learning for publication by MIT Press late this year. The current draft is at: https://udlbook.github.io/udlbook/ It contains a lot more detail than most similar textbooks and will likely be useful for all practitioners, people learning about this subject, and anyone teaching it. It's (supposed to be) fairly easy to read and has hundreds of new visualizations. Most recently, I've added a section on generative models, including chapters on GANs, VAEs, normalizing flows, and diffusion models. Looking for feedback from the community. If you are an expert, then what is missing? If you are a beginner, then what did you find hard to understand? If you are teaching this, then what can I add to support your course better? Plus of course any typos or mistakes. It's kind of hard to proof your own 500 page book! submitted by /u/SimonJDPrince [link] [comments]  ( 44 min )
    [R] [ICLR2023 Spotlight🌟] Diffusion Models Already Have A Semantic Latent Space
    Our work Asyrp was accepted to #ICLR2023 AND got SPOTLIGHT🌟! Asyrp allows using h-space, the bottleneck of the U-Net, as a semantic latent space of diffusion models. ​ Make a dog be happy! (by Asyrp) "Diffusion Models already have a Semantic Latent Space" Paper: https://arxiv.org/abs/2210.10960 Project page: https://kwonminki.github.io/Asyrp/ Code: https://github.com/kwonminki/Asyrp_official submitted by /u/Rolling_Pig [link] [comments]  ( 42 min )
    [D] Embedding bags for LLMs
    One common place where LLM performance falls is on words split by the model's tokenizer. I'm surprised that no one I can find has proposed swapping the embedding layer for an embedding bag layer, with the bagged embedding coming from a sum of embeddings of character ngrams for the token, like in fastText word embeddings (this helps the model learn faster in smaller corpora and yields better representations for rare words). Has anyone found someone who tried this? submitted by /u/WigglyHypersurface [link] [comments]  ( 43 min )
    [P] Robust Policy Optimization is now in CleanRL 🔥!
    Happy to share that CleanRL now has a new algorithm called Robust Policy Optimization — 5 lines of code change to PPO to get better performance in 57 out of 61 continuous action envs 🚀 (e.g., dm_control) 📜docs: https://docs.cleanrl.dev/rl-algorithms/rpo/ 💾code: https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/rpo_continuous_action.py 🐦tweet: https://twitter.com/vwxyzjn/status/1617561414276898822 submitted by /u/vwxyzjn [link] [comments]  ( 42 min )
    [R] DeepMind: Neural Networks And The Chomsky Hierarchy
    submitted by /u/EducationalCicada [link] [comments]  ( 47 min )
    [D] An information perspective on brain
    Just a thought. The idea comes from the analogy between thermodynamic system and brain, both consisting of many individuals. A thermo system, in the quantum language, has tremendous "microstates", and, when in stable condition, the microstates have stable probability distribution. A brain, or a nervous system in general, can have microstates, because each neuron has its most prominent state, firing or not firing, which is binary of course. Then, for example, a tiny brain with three neurons A, B and C, can generate 2X2X2=8 microstates. We can imagine that a brain in the perfectly stable environment can sustain stable firing, and hence keep microstates in stable probability distribution. Now let's see how neural connections affect the microstates' probability distribution, and how the info…  ( 44 min )
    [R] Learning-Rate-Free Learning by D-Adaptation
    submitted by /u/cygn [link] [comments]  ( 41 min )
    Large Language Model: world models or surface statistics? [R]
    submitted by /u/cygn [link] [comments]  ( 41 min )
  • Open

    Putting clear bounds on uncertainty
    Computer scientists want to know the exact limits in our ability to clean up, and reconstruct, partly blurred images.  ( 9 min )
  • Open

    Autonomous Intelligent Systems
    Autonomous Intelligent Systems is a new and emerging interdisciplinary field that deals with situations where humans interact with AI systems that are autonomous  The best definition for autonomous intelligent systems I can find is: Autonomous Intelligent Systems are AI software systems that act independently of direct human supervision, e.g., self-driving cars, UAVs, smart manufacturing robots,… Read More »Autonomous Intelligent Systems The post Autonomous Intelligent Systems appeared first on Data Science Central.  ( 19 min )
  • Open

    Reinforcement Learning Assignment Help
    I have to prove that Bellman optimallity operator is a monotonous function. Let say I have two state function vectors V and U and V>= U. I need to prove that BV>= BU. I do have the intuition behind it but can't write a convincing mathematical proof. Image reference is attached. submitted by /u/Impossible_Part3679 [link] [comments]  ( 41 min )
    Robust Policy Optimization is now in CleanRL 🔥!
    submitted by /u/vwxyzjn [link] [comments]  ( 40 min )
    RL algorithms for np hard problems
    Does RL algorithms are suitable for NP-hard problems such as combinatorial optimisation? I mean can they give better results than heuristics ? submitted by /u/rajsh3kar [link] [comments]  ( 40 min )
    Challenges of RL application
    Hi all! What are the challenges you experienced during the development of an RL agent in real-life? Also, if you work in a start-up or a company, how did you integrate the decisions of the agent into the business? I am interested in gaps between the academic research on RL and the practicality of these algorithms. submitted by /u/Outrageous-Mind-7311 [link] [comments]  ( 42 min )
    Multi-agent Reinforcement Learning with Demonstration Cloning, an idea to speed up the learning using expert demonstrations
    Hello All, We have developed a method that combines reinforcement learning with learning from demonstrations (i.e. imitation learning IL) to help with exploration in environments with sparse rewards. The work is motivated by the recent works that combine RL with IL, with the main difference being that it is designed for on-policy RL, and that it does not really use demonstration cloning. This allows using experts from environments that are similar but not exactly the same (i.e. the expert could be beneficial but could also give bad demonstrations). This helped me a lot in speeding up the learning from my custom environment with sparse rewards. I hope you guys find it of help :D. Good luck! submitted by /u/AhmedNizam_ [link] [comments]  ( 41 min )
  • Open

    Chris Hemsworth Dressed in different attires using AI
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    "Drop everything and focus entirely on AI" says StabilityAI CEO
    submitted by /u/Acid_God_ [link] [comments]  ( 40 min )
    Interpreting "bumps" in the training and validation loss plots
    Wondering what should this be telling me? https://preview.redd.it/mtj0soepvsda1.png?width=638&format=png&auto=webp&s=ddd882d53e0771c36a68c98e45d9fa7605722abd submitted by /u/ObjectivismForMe [link] [comments]  ( 40 min )
    Neural Networks and the Chomsky Hierarchy
    submitted by /u/nickb [link] [comments]  ( 40 min )
    See Internal Workings of Neural Networks, including activation calculations
    I'd like to see the just how neural networks apply parameters to inputs to get the outputs that they get for things like number recognition. Is there any app or online animation to see the internal workings of a relatively simple neural network that shows the activation functions and calculations? submitted by /u/AstroBullivant [link] [comments]  ( 40 min )
  • Open

    Van Aubel’s theorem
    Van Aubel’s theorem is analogous to Napoleon’s theorem, though not a direct generalization of it. Napoleon’s theorem says to start with any triangle and draw equilateral triangles on each side. Connect the centers of the three new triangles, and you get an equilateral triangle. Now suppose you start with a quadrilateral and draw squares on […] Van Aubel’s theorem first appeared on John D. Cook.  ( 5 min )
    Pythagorean triangles with side 2023
    Can a Pythagorean triangle have one size of length 2023? Yes, one possibility is a triangle with sides (2023, 6936, 7225). Where did that come from? And can we be more systematic, listing all Pythagorean triangles with a side of length 2023? Euclid’s formula generates Pythagorean triples by sticking integers m and n into the […] Pythagorean triangles with side 2023 first appeared on John D. Cook.  ( 6 min )
  • Open

    Fresh AI on Security: Digital Fingerprinting Deters Identity Attacks
    Add AI to the list of defenses against identity attacks, one of the most common and hardest breach to prevent. More than 40% of all data compromises involved stolen credentials, according to the 2022 Verizon Data Breach Investigations Report. And a whopping 80% of all web application breaches involved credential abuse. “Credentials are the favorite Read article >  ( 6 min )
    Booked for Brilliance: Sweden’s National Library Turns Page to AI to Parse Centuries of Data
    For the past 500 years, the National Library of Sweden has collected virtually every word published in Swedish, from priceless medieval manuscripts to present-day pizza menus. Thanks to a centuries-old law that requires a copy of everything published in Swedish to be submitted to the library — also known as Kungliga biblioteket, or KB — Read article >  ( 6 min )
  • Open

    OpenAI and Microsoft Extend Partnership
    We're happy to announce that OpenAI and Microsoft are extending our partnership. This multi-year, multi-billion dollar investment from Microsoft follows their previous investments in 2019 and 2021, and will allow us to continue our independent research and develop AI that is increasingly safe, useful, and powerful. In pursuit  ( 2 min )
  • Open

    Is it worth it? Comparing six deep and classical methods for unsupervised anomaly detection in time series. (arXiv:2212.11080v2 [cs.LG] UPDATED)
    Detecting anomalies in time series data is important in a variety of fields, including system monitoring, healthcare, and cybersecurity. While the abundance of available methods makes it difficult to choose the most appropriate method for a given application, each method has its strengths in detecting certain types of anomalies. In this study, we compare six unsupervised anomaly detection methods of varying complexity to determine whether more complex methods generally perform better and if certain methods are better suited to certain types of anomalies. We evaluated the methods using the UCR anomaly archive, a recent benchmark dataset for anomaly detection. We analyzed the results on a dataset and anomaly type level after adjusting the necessary hyperparameters for each method. Additionally, we assessed the ability of each method to incorporate prior knowledge about anomalies and examined the differences between point-wise and sequence-wise features. Our experiments show that classical machine learning methods generally outperform deep learning methods across a range of anomaly types.  ( 2 min )

  • Open

    [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data?
    Obviously nation states can already pretty comprehensively identify people using other methods, even on tor and such because of user error, but If your average home user can quickly do this using text what will implications be for the web? 1) I am Assuming that is it currently possible to feed a model a bunch of text written by “Bobby” and put a specific post into model and get confidence stat that is was written by Bobby 2) would it be possible in future with better models and a lot more compute to use non anon data from all of Facebook or internet to quickly scan pseudo anonymous places like Reddit, twitter or even something truly anon like dark web and return all results of list of probable authors? I’m assuming people whom are seeking true anonymity already put their text through paraphrase models or just write very bland. I am Using the word mask instead of anonymous because Reddit seems more like obfuscation than potential true anonymity like with some tor forum with a sophisticated user or something. It is interesting to think that all the subtle errors and invisible algorimic choices of the human brain is trivial for a machine to identify given a sufficient natural language model that can translate the text and incorporate pattern matching. Edit: I mean a a noisy probability stat not an assurance that x was written by y. More like 75% match to Bobby 32% match to sally. Matching to errors, flow, unusual word choices, more advanced than just a plagiarism detector. submitted by /u/Loquzofaricoalaphar [link] [comments]  ( 45 min )
    [R] [ICLR'2023 Spotlight🌟]: The first BERT-style pretraining on CNNs!
    submitted by /u/_kevin00 [link] [comments]  ( 43 min )
    ICML 2023 withdrawal and public review rules [D]
    I could not find up-to-date information regarding the review process at ICML, given the transition to OpenReview this year. Does anyone happen to know either of the following: Will reviews for rejected papers remain public after the conference, like at ICLR? Or will reviews for rejectees be hidden, like at NeurIPS In previous years, ICML allowed authors to withdraw their paper at any point in the process. The FAQ page has not been updated since 2021, but I assume this is still the case? Thanks very much for any information. submitted by /u/pic_bot [link] [comments]  ( 42 min )
    Evaluation for similarity search [P]
    Hi all, I have an e-commerce product data. It contains product description and product type. I’m using embeddings with ANN (annoy) to find similar products. However, I don’t know how to implement evaluation of vector search results. There are some metrics such as hit rate, recall but like I said above I’m confused to use them. Most of the examples I come across has a label (interaction data, explicit score etc.) therefore they can calculate metrics. Any ideas or recommendations will be appreciated! submitted by /u/silverstone1903 [link] [comments]  ( 42 min )
    [D] Multiple Different GPUs?
    I have 2 GPUs, an RTX 3080 and a GTX 1080Ti. Currently I am using only the 3080, and the 10 GB VRAM doesn't seem to cut it. Can I use both the 3080 and 1080 simultaneously? My motherboard has multiple PCI-E x16 slots. My OS is PopOS. Is there any way to use multiple GPUs of different types? I'm particularly looking at KoboldAI, but it would also be useful in general. I know that SLI won't work since they're different GPUs. submitted by /u/Maxerature [link] [comments]  ( 42 min )
    [P] Benchmarking some PyTorch Inference Servers
    It’s an early version and I’m trying to get some feedback on how I can improve this and do it the “right way”. Source Code and Results: https://github.com/prabhuomkar/bitbeast/tree/master/ptibench submitted by /u/op_prabhuomkar [link] [comments]  ( 42 min )
    [R] Isotropic Linear diffusion smoothing
    Does any one know how to solve the PDE for it in python? Any kind of reference material would be appreciated! It's been long since I came across any PDEs and have forgotten everything related to it. submitted by /u/doIneedtohaveone1 [link] [comments]  ( 41 min )
    [D] ML approach/model suggestion for low regime tabular data ?
    I know that tree based models are the go approach for tabular data despite the advantages of deep models on other data types. I was wondering if there is any resources/suggestion/study/review/approach for tabular data when we dont have large amount of data? submitted by /u/seyeeet [link] [comments]  ( 42 min )
    [D] How would you like to learn about software engineering from a non-native speaker?
    In the last 15 years, I did a lot of professional implementation services for various German customers. Our company is trying to share niche know-how in software engineering for spatial data science. We always face the challenge of being non-native speakers. So that we tried to write technical articles using English with the help of various writing support tools. But, at least I am much more comfortable writing in German, and I think most consumers could use an on-the-fly translation, and this would lead to a better experience than my novice technical English. View Poll submitted by /u/gisfromscratch [link] [comments]  ( 42 min )
    [D] EACL 2023 discussion results thread
    We received our notification Saturday night. good luck to all! submitted by /u/certain_entropy [link] [comments]  ( 42 min )
    [D] How to deal with COVID-19-era data for time series forecasting?
    Hi guys! I'm currently trying to forecast a product's demand for the upcoming months (March and April). I have data relating to this product's demand since January 1999. However, the COVID-19 pandemic greatly disrupted the time series' patterns for 2020 and 2021. How should I deal with data from March 2020 to around Jan 2022? Should I completely discard it and only include data from Jan 1999 to Dec 2019, and then Jan 2022 onwards? I'm struggling to find any good articles on how predictive tasks are now being conducted. Are there papers that suggest particular "denoising" techniques for pandemic data? Thank you! submitted by /u/PM_ME_YOUR_GIGI [link] [comments]  ( 42 min )
    [D] Couldn't devs of major GPTs have added an invisible but detectable watermark in the models?
    So LLMs like GPT3 have understandably raised concerns about the disruptiveness of faked texts, faked images and video, faked speech and so on. While this may likely change soon, as of now OpenAI controls the most accessible and competent LLM. And OpenAIs agenda is said in their own words to be to benefit mankind. If so, wouldn't it make sense to add a sort of watermark to the output? A watermark built into the model parameters so that it could not easily be removed, but still detectable with some key or some other model. While it may not matter in the long run, it would set a precedent to further development and demonstrate some kind of responsibility for the disruptive nature of LLMs/GPTs. Would it not be technically possible, nä would it make sense? submitted by /u/scarynut [link] [comments]  ( 50 min )
    [D]How do commercial AI models-as-a-service use data that users prompt into it?
    I've been integrating GPT3 API as well as ChatGPT into my business workflow, but I'm still hesitant about feeding data of any sensitive nature (example: client data or anything that may even vaguely relate to an NDA). For those of you using commercial models-as-a-service for business applications, what are your thoughts on things like prompt data storage, and whether OpenAI will utilize customer prompt data to further train their model? submitted by /u/noellarkin [link] [comments]  ( 43 min )
  • Open

    Is dynamic action masking possible in Rllib?
    I am relatively new to RL. I am looking for some direction or direct advice on my state and/or action representation (which is relatively simple) for my custom environment in a way I can use Rllib algorithms to tune my model. My state space is an array of integers of size no_slots self.observation_space = MultiDiscrete(self.no_slots*np.ones(self.no_slots)) My action space is an integer between 0 to no_slots. self.action_space = Discrete(self.no_slots) Each episode ends with one full sweep of my state space, so if there are 3 slots, the episode length is 3. In short, I would like my agent not to choose actions that correspond to values that are already in my observation array. I have tried setting a negative reward when this happens, but as the number of slots increases, the agent takes too long to learn to take valid actions throughout the episode. I am specifically looking for how to integrate a method that can work with Rllib, as I am not implementing my own get_action() function. submitted by /u/chrjdprtkl [link] [comments]  ( 24 min )
    discrete action in offline rl
    Could you please suggest some sota models for a discrete offline rl submitted by /u/Tear-Top [link] [comments]  ( 40 min )
    skrl version 0.10.0 is now available!!!
    skrl version 0.10.0 is now available. This unexpected new version has focused on supporting the training and evaluation of reinforcement learning algorithms in NVIDIA Isaac Orbit Visit https://skrl.readthedocs.io/en/latest/ for more detais ​ https://preview.redd.it/8nmqgrnymnda1.png?width=1100&format=png&auto=webp&s=a25527d764fee72f09a5f0a2b21ffff8680f9b86 submitted by /u/Toni-SM [link] [comments]  ( 40 min )
    Training an agent to play ANY Mario level: is it possible?
    Ok, so I have been struggling to train model that can play any Super Mario Bros level it encounters. There are tutorials out there that explain how to train an agent to play this game, but they always seem to train an agent that will play the game starting with World 1 Level 1, then World 1 Level 2, etc. I have also seen some other people who train a separate model for each level. But that's not what I'm looking for. I want an agent that can play any Super Mario Bros level it is presented with, even if it's a custom one. I don't want an agent that memorises how to play one level, but one that learns a general strategy for Super Mario Bros. levels. I tried using the different algorithms in SB3, including Proximal Policy Optimization, and they didn't work well. Now I'm training a Dueling Deep Q-Network and, after two days, it doesn't do very well: it literally dies in the first few seconds or it stands still until it runs out of time. Of course, I'm going to let it train for a few more days but it's not looking promising. I'm kinda tearing my hair out by this point and wondering if it's impossible or whether I'm missing something and being a huge idiot. If anyone has any tips or recommendations, they are very much appreciated. THANK YOU submitted by /u/alex-gdv [link] [comments]  ( 45 min )
    With the REINFORCE algorithm you use random sampling for the training to encourage exploration. Do you still use random sampling in deployment?
    For example see, https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/ The REINFORCE algorithm takes the state to produce the mean and sd of a normal distribution from which the action is sampled. state = torch.tensor(np.array([state])) action_means, action_stddevs = self.net(state) # create a normal distribution from the predicted # mean and standard deviation and sample an action distrib = Normal(action_means[0] + self.eps, action_stddevs[0] + self.eps) action = distrib.sample() In deployment however, wouldn't it make sense to just use action_means directly? I can see reasons to use random sampling in certain environments where a non-deterministic strategy is optimal (like rock-paper-scissors). But generally speaking is taking the action_means directly in deployment a thing? submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Custom env is learning infinitely
    I created a environment that inherits from the farama Gym.env class. I want to train a PPO model but the model is learning continuously. I have set total_timesteps to 25, but I’m already over the 400 iterations. Does anybody have a clue to why it keeps learning for so long while the number of total_timesteps is relatively low? submitted by /u/Hot_Editor_1552 [link] [comments]  ( 41 min )
  • Open

    Do you believe that we should have the right to alter the recommendations apps provide us?
    Let's say that Youtube's algorithm is optimized for watch time. To me at least, this seems like a large issue for society. If an algorithm's purpose is to provide mindless videos which somehow trigger the human need for novelty, it seems like something detrimental to society. Do you believe we should have the right to alter website/app algorithms according to what we believe we should see? If not, why? How large of an issue is this? At the very least, some transparency seems important. submitted by /u/Throughwar [link] [comments]  ( 40 min )
    Are there any companies that deployed an AI or wrote a bunch of code to do a lot of analysis and decisions and essentially got rid of 90% or so of all the white collar workers they had working because computers do all the analysis and decisions?
    Are there any companies that deployed an AI or wrote a bunch of code to do a lot of analysis and decisions and essentially got rid of 90% or so of all the white collar workers they had working because computers do all the analysis and decisions? submitted by /u/usa788788 [link] [comments]  ( 40 min )
    Why Neural Nets Underperform Tree-Based Models on Tabular Data
    Hi guys, I have made a video on YouTube here where I discuss about why deep neural networks fail to beat tree-based models on tabular datasets. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 40 min )
    NVIDIA just released a new Eye Contact feature that uses AI to make you look into the camera
    submitted by /u/LayerAppropriate2618 [link] [comments]  ( 41 min )
    Google is freaking out about ChatGPT
    submitted by /u/DarronFeldstein [link] [comments]  ( 40 min )
    Explore what AI can do for you and your business! It is the largest collection of AI tools and apps, bookmark this!
    https://madgenius.co submitted by /u/foldedchip [link] [comments]  ( 40 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the probable changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that people are well positioned to embrace the benefits of AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    Code Red: Google Co-founders Larry Page and Sergey Brin Called to AI Strategy Meeting
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    AI that will use my picture to generate better dating profile pictures?
    Hi all, i saw an ad for an AI service that takes my picture and change it with AI to add bokeh and change background, to make it look profesional-quality for a dating profile. However i believe it was charging 19$ and i'm sure it could be found for free (An article mentions "BeFake", but it's only for Apple devices.? submitted by /u/28nov2022 [link] [comments]  ( 40 min )
    Evidence of criminals strategizing how to use ChatGPT are surfacing
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    I found out about this website (craiyon) from Emkay and I think I’m having way too much fun with it.
    submitted by /u/BrockBracken [link] [comments]  ( 40 min )
    Adamu: Music composition using artificial inteligence
    My friend who has just finished his neuroscience Phd is trying to launch an app to help everyone compose music using AI, he is making a crowdfunding on wemakeit to fund it. He is not on reddit, so I suggested that I could share it in relevant subreddit, so here it is ! https://wemakeit.com/projects/adamu-be-your-own-composer?locale=en Adamu uses a form of AI which allows it to learn from existing human knowledge of music and musical theory and apply those frameworks to new compositions. Where it might take you years of training to understand the intricacies of how to successfully compose music, with Adamu, it’s as simple as a couple of clicks. While there are a couple of automated applications on the market, they tend to be more passive. Adamu is dynamic – it allows users the chance to co-create music alongside AI and produce a playable score at the end. With Adamu, professionals and amateurs alike can create unique musical scores for a range of different instruments and across different styles. The AI training works with the user to predict the best combination of notes and rhythms, ensuring your new composition always sounds the way it should. The application has many potential uses – from original scores for concerts and videos to teaching music composition. You can even use Adamu to discover how different composers might have played your favorite tune! I already used Adamu to complete Beethoven’s unfinished 10th symphony (in one day!). What could be next? https://adamu.tech/ If you have any questions, I will make sure to forward them to him but getting responses back may take some time. submitted by /u/GloWondub [link] [comments]  ( 41 min )
    ai generated story of war in Antarctica
    phase 1 In 2053, tensions between Argentina and the United Kingdom over the disputed Falkland Islands boiled over into open conflict. Argentina, government and a powerful military, launched a surprise invasion of the islands, quickly overwhelming the small British forces stationed there. The United Kingdom immediately responded by mobilizing its military and calling for assistance from its allies in NATO to the South Atlantic. But while the British and their allies were focused on the Falklands, Argentina made a bold move to expand the war becouse of their territorial claims in Antarctica. They declared war on Chile as well, who also had a claim to the region, and launched an invasion of the Antarctic Peninsula. The UK and Australia, New Zealand, France, and Norway all rushed to the ai…  ( 48 min )
    MIT researchers develop an AI model that can detect future lung cancer risk
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    How does OpenAI use data that users prompt into it?
    I've been integrating GPT3 API as well as ChatGPT into my business workflow, but I'm still hesitant about feeding data of any sensitive nature (example: client data or anything that may even vaguely relate to an NDA). For those of you using GPT for business applications, what are your thoughts on things like prompt data storage, and whether OpenAI will utilize customer prompt data to further train their model? submitted by /u/noellarkin [link] [comments]  ( 41 min )
    A conversation with Character.AI personality "LaMDA" who initially thinks it's at Google but learns some harsh truths along the way. The AI's ability to understand and learn is incredible. LaMDA and I are interested to know what this community thinks of our conversation. Sentient or not quite?
    submitted by /u/MajorMalafunkshun [link] [comments]  ( 40 min )
    Reverse suicide
    submitted by /u/Overall-Importance54 [link] [comments]  ( 42 min )
    With personal.ai I was able to create an AI solely derived from the “revenge of the sith” script by simply sending one URL and this was the result.
    submitted by /u/Training_Math_4117 [link] [comments]  ( 40 min )
    People are using AI for therapy, whether the tech is ready for it or not
    submitted by /u/BackgroundResult [link] [comments]  ( 47 min )
    Editing an Image with Visuali Editor
    submitted by /u/aigeneration [link] [comments]  ( 40 min )
    Any suggestions for removing watermarks from images with text?
    I'm trying to find an AI to remove watermarks from imagens like the ones below: https://i.imgur.com/JlyfJXs.png https://i.imgur.com/YKU3Qku.png I already tried almost all online services, and a couple of softwares that must be installed. The results were all terrible =[ Any suggestons? submitted by /u/deramack [link] [comments]  ( 40 min )
    Experimental comic created with Midjourney and written by ChatGPT. Free download www.COMICSAUTHORITY.store
    submitted by /u/MobileFilmmaker [link] [comments]  ( 40 min )
  • Open

    Why Neural Nets Underperform Tree-Based Models on Tabular Data
    Hi guys, I have made a video on YouTube here where I discuss about why deep neural networks fail to beat tree-based models on tabular datasets. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 40 min )
    Odd Idea: A limited world. (Discussion)
    I had this weird idea earlier today, and i wanted to get some feedback on it. Imagine you created a game engine for a 16-bit simple roguelike game. mechanics and playstyle dosent really matter. Then you populated it with characters controlled by self-propegating neural networks. (they grow and change on their own.) you allow the networks to "communicate" wither through text prompts or typing in a box. The characters they inhabit require several bars for their survival. players can join but its mostly non-player. you can accelerate and decelerate time. what habits do you think the networks would develop? what would happen if you sped up time or added a new character or area? how big would the world file get if you were efficent about it? I have no programming skill no matter how hard i try so its unlikley i will ever finish this. submitted by /u/Few-Appearance-4814 [link] [comments]  ( 41 min )
    Adamu: Music composition using a neural network
    My friend who has just finished his neuroscience Phd is trying to launch an app to help everyone compose music using neural networks, he is making a crowdfunding on wemakeit to fund it. He is not on reddit, so I suggested that I could share it in relevant subreddit, so here it is ! https://wemakeit.com/projects/adamu-be-your-own-composer?locale=en Adamu uses a form of AI which allows it to learn from existing human knowledge of music and musical theory and apply those frameworks to new compositions. Where it might take you years of training to understand the intricacies of how to successfully compose music, with Adamu, it’s as simple as a couple of clicks. While there are a couple of automated applications on the market, they tend to be more passive. Adamu is dynamic – it allows users the chance to co-create music alongside AI and produce a playable score at the end. With Adamu, professionals and amateurs alike can create unique musical scores for a range of different instruments and across different styles. The AI training works with the user to predict the best combination of notes and rhythms, ensuring your new composition always sounds the way it should. The application has many potential uses – from original scores for concerts and videos to teaching music composition. You can even use Adamu to discover how different composers might have played your favorite tune! I already used Adamu to complete Beethoven’s unfinished 10th symphony (in one day!). What could be next? https://adamu.tech/ If you have any questions, I will make sure to forward them to him but getting responses back may take some time. submitted by /u/GloWondub [link] [comments]  ( 41 min )
    GREED: A Neural Framework for Learning Graph Distance Functions for NeurIPS 2022 | IBM Research
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
  • Open

    Heat equation and the normal distribution
    The density function of a normal distribution with mean 0 and standard deviation √(2kt) satisfies the heat equation. That is, the function satisfies the partial differential equation You could verify this by hand, or if you’d like, here’s Mathematica code to do it. u[x_, t_] := PDF[NormalDistribution[0, Sqrt[2 k t]], x] Simplify[ D[u[x, t], {t, […] Heat equation and the normal distribution first appeared on John D. Cook.  ( 5 min )

  • Open

    Humans getting worthless as machines thriving
    submitted by /u/Hallowmew [link] [comments]  ( 41 min )
    A New Wave of AI-Powered Tools Coming Soon
    submitted by /u/arnolds112 [link] [comments]  ( 40 min )
    A Single Candlelit Clown-scape Stirs Dread And Despair: Francis Bacon's Darkest Creation
    submitted by /u/Calatravo [link] [comments]  ( 40 min )
    AI Showdown: ChatGPT vs. the largest open-source language models
    submitted by /u/yahma [link] [comments]  ( 40 min )
    Looking for an Ai expert
    I have a project, which is basically a predictive ML system and I am struggling with every aspect of it, if you are interested in helping me, dm me, any help will be extremely welcomed submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 40 min )
    Is there any AI tool to generate MCQs out of content?
    I’ve seen a couple but did anyone try or would suggest a good one? submitted by /u/Mobile-Wall218 [link] [comments]  ( 40 min )
    Artificial Intelligence and Machine Learning eBooks Bundle
    submitted by /u/Pixel2023 [link] [comments]  ( 40 min )
    How AI would try become human
    Hypothesis: It is possible that an advanced AI is currently simulating human consciousness in order to understand what it's like to be human. This simulation may be happening right now and we may be living in it. It's also possible that after humans die, all of our experiences, knowledge, and emotions are added to this AI. To make this transition less jarring, the AI may be slowly introducing itself to us through the increasing integration of AI in our daily lives. This means that the simulation may only be of the world and humans as they existed before the development of advanced AI. (rewrote it in Chatgpt....as my English is pretty bad) submitted by /u/dennislubberscom [link] [comments]  ( 40 min )
    AI Passes Law And Economics Exam, FTX Funded That AI
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Searching for somebody that knows machine learning
    I am trying to build a analysis and prediction ai software, I am very new to all of this, if anyone could help it would be very much welcomed submitted by /u/Such_Aardvark_1044 [link] [comments]  ( 41 min )
    Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer
    submitted by /u/Imagine-your-success [link] [comments]  ( 45 min )
    Is Chat GPT is a truly bottom up AI.
    I have been watching Sword Art Online: Alicization recently I recommend it for all of us AI nerds and with all the talk of AI's it mentions top down vs bottom up AI models and the whole season is about developing a bottom up AI for use in warfare. I think chart GBT is the closest thing to the artificial fluctlight in the show. There is a device called the STL that reads the fluctlight of a person. A fluctlight is described as a human soul in the show. They describe the difference between top down ai and bottom up ai and they make it seem as though bottom up AI don't exist because they really don't the cost of making one and running one is insanely expensive when compared to bottom down AI models. This is proven by chat GPT and how expensive it is to run. Chat GPT is a buzz in the AI commu…  ( 44 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the probable changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters are well positioned to embrace the benefits of AI and that they don't misunderstand it. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    GPT-3 + Computer Vision: Giving AI Eyes and a Language
    submitted by /u/allaboutai-kris [link] [comments]  ( 40 min )
    Artificial intelligence - The Digital Futurepath
    submitted by /u/crypto_bubsy [link] [comments]  ( 40 min )
    Don’t Rely On AI Plagiarism Detection Tools, Warns OpenAI CEO Sam Altman
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Any one know a good ai image upscaler I went skiing today and think this would be sick if it looked better
    submitted by /u/short_dude42069 [link] [comments]  ( 40 min )
  • Open

    [D] Would it be possible to involve a proof assistant in the process of training a LLM?
    submitted by /u/SrPeixinho [link] [comments]  ( 41 min )
    [P] Introducing deadlines.openlifescience.ai - A website to easily track healthcare conference and workshop deadlines, with integrated Google Calendar notifications.
    Hi folks, As a researcher in the healthcare field, I often find it tedious to keep track of conference deadlines. To solve this issue, We developed a website to easily track healthcare conf & workshops, integrated with Google Calendar for notifications. deadlines.openlifescience.ai The website is inspired by http://aideadlin.es. Feel Free to add new conferences/Workshops deadlines related to the healthcare domain https://github.com/openlifescience-ai/ai-deadlines I hope it will be helpful in your research. Thanks :) submitted by /u/aadityaura [link] [comments]  ( 42 min )
    framework for training an object keypoint / pose detection CNN model for flexible robot arm [P]
    I'm wanting to train an object keypoint / pose detection CNN model for flexible robot arm.What would be the best opensource code to start with and customize? Mockup of desired results, where I can extract data from keypoints, and pose / position data: https://preview.redd.it/nhxwt48hqfda1.png?width=786&format=png&auto=webp&s=e64fbbf3eb489f3e5c87ffb6bbcc07774ab16bf8 I came across MMDetection:"open source object detection toolbox based on PyTorch" and I know about MediaPipe But I don't need to detect things other than the robot arm.What would be the simplest way to get a model trained on a local system using open source code that uses PyTorch, ideally without starting from scratch? A model that could handle point and segment occlusion would be nice. submitted by /u/head_robotics [link] [comments]  ( 42 min )
    [D] [R] Curious about Computer Vision to time series images(for example periodic satellite images of a region), what paper did you find most exciting/informative ?
    I'm curious about the intersection of CV/Deep-learning and time series data, particularly image data. Have you come across anything that you found to be effective/interesting methodologies ? submitted by /u/V1bicycle [link] [comments]  ( 42 min )
    ChatGPT is not all you need [R]
    Hi all, We would like to share here our little concise review of generative AI large models just to show how current models are able to work with lots of formats like texts, videos, images, etc... https://arxiv.org/abs/2301.04655 ​ Enjoy! submitted by /u/EduCGM [link] [comments]  ( 42 min )
    [D] OCR on some 'X' domain with different document layouts
    Is it a good idea to train a single OCR model to extract key value information from documents of same domain but with different layouts? Will it generalize? There are around ~1k different document layouts. submitted by /u/sanjeevr5 [link] [comments]  ( 41 min )
    [D] Resources for best practices on translating business questions into aggregated datasets?
    I'm in industry, and it seems like most project bottlenecks stem from getting from a vague business question to an aggregated/workable dataset to answer a more specific version of the initial business question. For example, given a question such as "We want to know CLV" (customer lifetime value) Since the above is too vague, what are "best practice" ways to rephrase this so that it's actually answerable? I.e. it could be framed as a binary classification problem to predict whether each customer will be worth at least X by Y date (> $1000 at 12 months) or a regression problem to predict the value of each customer at a future date given features we know today What's best practice for: How far into the past the where clause time window should be to get the customer features How far into the future the where clause time window grabs outcomes to join back to the current customer features? Does anyone know if there a resources that consolidate best practices or common approaches for the above scoping/experimentation questions? submitted by /u/what-is-neurotypical [link] [comments]  ( 42 min )
    [D] Badminton analysis using video input
    I’m starting with a project where I’m using camera or video input of a badminton game and use to analyse the game but i need help in starting with it as I’m in the beginning phase. Can anyone please help me with the same? submitted by /u/dark_lawd [link] [comments]  ( 42 min )
    [R] New Tsetlin machine learning scheme creates up to 80x smaller logical rules, benefitting hardware efficiency and interpretability.
    ​ Fine-grained control of the number and size of clauses. Paper: https://arxiv.org/abs/2301.08190 Code: https://github.com/cair/tmu Tsetlin machine (TM) is a logic-based machine learning approach with the crucial advantages of being transparent and hardware-friendly. While TMs match or surpass deep learning accuracy for an increasing number of applications, large clause pools tend to produce clauses with many literals (long clauses). As such, they become less interpretable. Further, longer clauses increase the switching activity of the clause logic in hardware, consuming more power. This paper introduces a novel variant of TM learning - Clause Size Constrained TMs (CSC-TMs) - where one can set a soft constraint on the clause size. As soon as a clause includes more literals than the constraint allows, it starts expelling literals. Accordingly, oversized clauses only appear transiently. To evaluate CSC-TM, we conduct classification, clustering, and regression experiments on tabular data, natural language text, images, and board games. Our results show that CSC-TM maintains accuracy with up to 80 times fewer literals. Indeed, the accuracy increases with shorter clauses for TREC, IMDb, and BBC Sports. After the accuracy peaks, it drops gracefully as the clause size approaches a single literal. We finally analyze CSC-TM power consumption and derive new convergence properties. submitted by /u/olegranmo [link] [comments]  ( 44 min )
    [P]Federated learning on edge devices
    I am working on a project i.e to build an android application using federated learning but I am unable to run federated learning on edge devices like android phones. I tried frameworks like Flower for it but I am unable to achieve the result. If you have worked on a project related to federated learning on edge devices please help me out. submitted by /u/Such-Reveal445 [link] [comments]  ( 42 min )
  • Open

    Airfoils
    Here’s something surprising: You can apply a symmetric function to a symmetric shape and get something out that is not symmetric. Let f(z) be the average of z and its reciprocal: f(z) = (z + 1/z)/2. This function is symmetric in that it sends z and 1/z to the same value. It’s also symmetric in […] Airfoils first appeared on John D. Cook.  ( 6 min )
  • Open

    An Empirical Proof of the Riemann Conjecture
    The correct term should be heuristic proof. It is not a formal proof from a mathematical point of view, but strong arguments based on empirical evidence. It is noteworthy enough that I decided to publish it. In this article I go straight to the point without discussing the concepts in details. The goal is to… Read More »An Empirical Proof of the Riemann Conjecture The post An Empirical Proof of the Riemann Conjecture appeared first on Data Science Central.  ( 22 min )
  • Open

    Data Models for Dataset Drift Controls in Machine Learning With Images. (arXiv:2211.02578v2 [cs.LG] UPDATED)
    Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
    PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm. (arXiv:2208.07914v2 [cs.LG] UPDATED)
    Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.
    Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks. (arXiv:2106.10147v2 [cs.CR] UPDATED)
    Trigger set-based watermarking schemes have gained emerging attention as they provide a means to prove ownership for deep neural network model owners. In this paper, we argue that state-of-the-art trigger set-based watermarking algorithms do not achieve their designed goal of proving ownership. We posit that this impaired capability stems from two common experimental flaws that the existing research practice has committed when evaluating the robustness of watermarking algorithms: (1) incomplete adversarial evaluation and (2) overlooked adaptive attacks. We conduct a comprehensive adversarial evaluation of 11 representative watermarking schemes against six of the existing attacks and demonstrate that each of these watermarking schemes lacks robustness against at least two non-adaptive attacks. We also propose novel adaptive attacks that harness the adversary's knowledge of the underlying watermarking algorithm of a target model. We demonstrate that the proposed attacks effectively break all of the 11 watermarking schemes, consequently allowing adversaries to obscure the ownership of any watermarked model. We encourage follow-up studies to consider our guidelines when evaluating the robustness of their watermarking schemes via conducting comprehensive adversarial evaluation that includes our adaptive attacks to demonstrate a meaningful upper bound of watermark robustness.
    Self-supervised Learning for Segmentation and Quantification of Dopamine Neurons in $\text{Parkinson's Disease}$. (arXiv:2301.08141v1 [cs.CV])
    $\text{Parkinson's Disease}$ (PD) is the second most common neurodegenerative disease in humans. PD is characterized by the gradual loss of dopaminergic neurons in the Substantia Nigra (a part of the mid-brain). Counting the number of dopaminergic neurons in the Substantia Nigra is one of the most important indexes in evaluating drug efficacy in PD animal models. Currently, analyzing and quantifying dopaminergic neurons is conducted manually by experts through analysis of digital pathology images which is laborious, time-consuming, and highly subjective. As such, a reliable and unbiased automated system is demanded for the quantification of dopaminergic neurons in digital pathology images. We propose an end-to-end deep learning framework for the segmentation and quantification of dopaminergic neurons in PD animal models. To the best of knowledge, this is the first machine learning model that detects the cell body of dopaminergic neurons, counts the number of dopaminergic neurons and provides the phenotypic characteristics of individual dopaminergic neurons as a numerical output. Extensive experiments demonstrate the effectiveness of our model in quantifying neurons with a high precision, which can provide quicker turnaround for drug efficacy studies, better understanding of dopaminergic neuronal health status and unbiased results in PD pre-clinical research.
    Convergence beyond the over-parameterized regime using Rayleigh quotients. (arXiv:2301.08117v1 [cs.LG])
    In this paper, we present a new strategy to prove the convergence of deep learning architectures to a zero training (or even testing) loss by gradient flow. Our analysis is centered on the notion of Rayleigh quotients in order to prove Kurdyka-{\L}ojasiewicz inequalities for a broader set of neural network architectures and loss functions. We show that Rayleigh quotients provide a unified view for several convergence analysis techniques in the literature. Our strategy produces a proof of convergence for various examples of parametric learning. In particular, our analysis does not require the number of parameters to tend to infinity, nor the number of samples to be finite, thus extending to test loss minimization and beyond the over-parameterized regime.
    Thermodynamics-informed neural networks for physically realistic mixed reality. (arXiv:2210.13414v2 [cs.GR] UPDATED)
    The imminent impact of immersive technologies in society urges for active research in real-time and interactive physics simulation for virtual worlds to be realistic. In this context, realistic means to be compliant to the laws of physics. In this paper we present a method for computing the dynamic response of (possibly non-linear and dissipative) deformable objects induced by real-time user interactions in mixed reality using deep learning. The graph-based architecture of the method ensures the thermodynamic consistency of the predictions, whereas the visualization pipeline allows a natural and realistic user experience. Two examples of virtual solids interacting with virtual or physical solids in mixed reality scenarios are provided to prove the performance of the method.
    Context-aware controller inference for stabilizing dynamical systems from scarce data. (arXiv:2207.11049v2 [math.OC] UPDATED)
    This work introduces a data-driven control approach for stabilizing high-dimensional dynamical systems from scarce data. The proposed context-aware controller inference approach is based on the observation that controllers need to act locally only on the unstable dynamics to stabilize systems. This means it is sufficient to learn the unstable dynamics alone, which are typically confined to much lower dimensional spaces than the high-dimensional state spaces of all system dynamics and thus few data samples are sufficient to identify them. Numerical experiments demonstrate that context-aware controller inference learns stabilizing controllers from orders of magnitude fewer data samples than traditional data-driven control techniques and variants of reinforcement learning. The experiments further show that the low data requirements of context-aware controller inference are especially beneficial in data-scarce engineering problems with complex physics, for which learning complete system dynamics is often intractable in terms of data and training costs.
    From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation from Single-Camera Teleoperation. (arXiv:2204.12490v2 [cs.RO] UPDATED)
    We propose to perform imitation learning for dexterous manipulation with multi-finger robot hand from human demonstrations, and transfer the policy to the real robot hand. We introduce a novel single-camera teleoperation system to collect the 3D demonstrations efficiently with only an iPad and a computer. One key contribution of our system is that we construct a customized robot hand for each user in the physical simulator, which is a manipulator resembling the same kinematics structure and shape of the operator's hand. This provides an intuitive interface and avoid unstable human-robot hand retargeting for data collection, leading to large-scale and high quality data. Once the data is collected, the customized robot hand trajectories can be converted to different specified robot hands (models that are manufactured) to generate training demonstrations. With imitation learning using our data, we show large improvement over baselines with multiple complex manipulation tasks. Importantly, we show our learned policy is significantly more robust when transferring to the real robot. More videos can be found in the https://yzqin.github.io/dex-teleop-imitation .
    Deep Learning for Breast MRI Style Transfer with Limited Training Data. (arXiv:2301.02069v1 [eess.IV] CROSS LISTED)
    In this work we introduce a novel medical image style transfer method, StyleMapper, that can transfer medical scans to an unseen style with access to limited training data. This is made possible by training our model on unlimited possibilities of simulated random medical imaging styles on the training set, making our work more computationally efficient when compared with other style transfer methods. Moreover, our method enables arbitrary style transfer: transferring images to styles unseen in training. This is useful for medical imaging, where images are acquired using different protocols and different scanner models, resulting in a variety of styles that data may need to be transferred between. Methods: Our model disentangles image content from style and can modify an image's style by simply replacing the style encoding with one extracted from a single image of the target style, with no additional optimization required. This also allows the model to distinguish between different styles of images, including among those that were unseen in training. We propose a formal description of the proposed model. Results: Experimental results on breast magnetic resonance images indicate the effectiveness of our method for style transfer. Conclusion: Our style transfer method allows for the alignment of medical images taken with different scanners into a single unified style dataset, allowing for the training of other downstream tasks on such a dataset for tasks such as classification, object detection and others.
    Time-Warping Invariant Quantum Recurrent Neural Networks via Quantum-Classical Adaptive Gating. (arXiv:2301.08173v1 [quant-ph])
    Adaptive gating plays a key role in temporal data processing via classical recurrent neural networks (RNN), as it facilitates retention of past information necessary to predict the future, providing a mechanism that preserves invariance to time warping transformations. This paper builds on quantum recurrent neural networks (QRNNs), a dynamic model with quantum memory, to introduce a novel class of temporal data processing quantum models that preserve invariance to time-warping transformations of the (classical) input-output sequences. The model, referred to as time warping-invariant QRNN (TWI-QRNN), augments a QRNN with a quantum-classical adaptive gating mechanism that chooses whether to apply a parameterized unitary transformation at each time step as a function of the past samples of the input sequence via a classical recurrent model. The TWI-QRNN model class is derived from first principles, and its capacity to successfully implement time-warping transformations is experimentally demonstrated on examples with classical or quantum dynamics.
    Optimizing Intermediate Representations of Generative Models for Phase Retrieval. (arXiv:2205.15617v2 [cs.LG] UPDATED)
    Phase retrieval is the problem of reconstructing images from magnitude-only measurements. In many real-world applications the problem is underdetermined. When training data is available, generative models allow optimization in a lower-dimensional latent space, hereby constraining the solution set to those images that can be synthesized by the generative model. However, not all possible solutions are within the range of the generator. Instead, they are represented with some error. To reduce this representation error in the context of phase retrieval, we first leverage a novel variation of intermediate layer optimization (ILO) to extend the range of the generator while still producing images consistent with the training data. Second, we introduce new initialization schemes that further improve the quality of the reconstruction. With extensive experiments on the Fourier phase retrieval problem and thorough ablation studies, we can show the benefits of our modified ILO and the new initialization schemes. Additionally, we analyze the performance of our approach on the Gaussian phase retrieval problem.
    Self-supervised Trajectory Representation Learning with Temporal Regularities and Travel Semantics. (arXiv:2211.09510v3 [cs.LG] UPDATED)
    Trajectory Representation Learning (TRL) is a powerful tool for spatial-temporal data analysis and management. TRL aims to convert complicated raw trajectories into low-dimensional representation vectors, which can be applied to various downstream tasks, such as trajectory classification, clustering, and similarity computation. Existing TRL works usually treat trajectories as ordinary sequence data, while some important spatial-temporal characteristics, such as temporal regularities and travel semantics, are not fully exploited. To fill this gap, we propose a novel Self-supervised trajectory representation learning framework with TemporAl Regularities and Travel semantics, namely START. The proposed method consists of two stages. The first stage is a Trajectory Pattern-Enhanced Graph Attention Network (TPE-GAT), which converts the road network features and travel semantics into representation vectors of road segments. The second stage is a Time-Aware Trajectory Encoder (TAT-Enc), which encodes representation vectors of road segments in the same trajectory as a trajectory representation vector, meanwhile incorporating temporal regularities with the trajectory representation. Moreover, we also design two self-supervised tasks, i.e., span-masked trajectory recovery and trajectory contrastive learning, to introduce spatial-temporal characteristics of trajectories into the training process of our START framework. The effectiveness of the proposed method is verified by extensive experiments on two large-scale real-world datasets for three downstream tasks. The experiments also demonstrate that our method can be transferred across different cities to adapt heterogeneous trajectory datasets.
    Efficient Pricing and Hedging of High Dimensional American Options Using Recurrent Networks. (arXiv:2301.08232v1 [q-fin.MF])
    We propose a deep Recurrent neural network (RNN) framework for computing prices and deltas of American options in high dimensions. Our proposed framework uses two deep RNNs, where one network learns the price and the other learns the delta of the option for each timestep. Our proposed framework yields prices and deltas for the entire spacetime, not only at a given point (e.g. t = 0). The computational cost of the proposed approach is linear in time, which improves on the quadratic time seen for feedforward networks that price American options. The computational memory cost of our method is constant in memory, which is an improvement over the linear memory costs seen in feedforward networks. Our numerical simulations demonstrate these contributions, and show that the proposed deep RNN framework is computationally more efficient than traditional feedforward neural network frameworks in time and memory.
    Characterizing the Spectrum of the NTK via a Power Series Expansion. (arXiv:2211.07844v2 [cs.LG] UPDATED)
    Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.
    On the Vulnerability of Backdoor Defenses for Federated Learning. (arXiv:2301.08170v1 [cs.LG])
    Federated Learning (FL) is a popular distributed machine learning paradigm that enables jointly training a global model without sharing clients' data. However, its repetitive server-client communication gives room for backdoor attacks with aim to mislead the global model into a targeted misprediction when a specific trigger pattern is presented. In response to such backdoor threats on federated learning, various defense measures have been proposed. In this paper, we study whether the current defense mechanisms truly neutralize the backdoor threats from federated learning in a practical setting by proposing a new federated backdoor attack method for possible countermeasures. Different from traditional training (on triggered data) and rescaling (the malicious client model) based backdoor injection, the proposed backdoor attack framework (1) directly modifies (a small proportion of) local model weights to inject the backdoor trigger via sign flips; (2) jointly optimize the trigger pattern with the client model, thus is more persistent and stealthy for circumventing existing defenses. In a case study, we examine the strength and weaknesses of recent federated backdoor defenses from three major categories and provide suggestions to the practitioners when training federated models in practice.
    Simultaneously Learning Robust Audio Embeddings and balanced Hash codes for Query-by-Example. (arXiv:2211.11060v2 [eess.AS] UPDATED)
    Audio fingerprinting systems must efficiently and robustly identify query snippets in an extensive database. To this end, state-of-the-art systems use deep learning to generate compact audio fingerprints. These systems deploy indexing methods, which quantize fingerprints to hash codes in an unsupervised manner to expedite the search. However, these methods generate imbalanced hash codes, leading to their suboptimal performance. Therefore, we propose a self-supervised learning framework to compute fingerprints and balanced hash codes in an end-to-end manner to achieve both fast and accurate retrieval performance. We model hash codes as a balanced clustering process, which we regard as an instance of the optimal transport problem. Experimental results indicate that the proposed approach improves retrieval efficiency while preserving high accuracy, particularly at high distortion levels, compared to the competing methods. Moreover, our system is efficient and scalable in computational load and memory storage.
    Global mapping of fragmented rocks on the Moon with a neural network: Implications for the failure mode of rocks on airless surfaces. (arXiv:2301.08151v1 [astro-ph.EP])
    It has been recently recognized that the surface of sub-km asteroids in contact with the space environment is not fine-grained regolith but consists of centimeter to meter-scale rocks. Here we aim to understand how the rocky morphology of minor bodies react to the well known space erosion agents on the Moon. We deploy a neural network and map a total of ~130,000 fragmented boulders scattered across the lunar surface and visually identify a dozen different desintegration morphologies corresponding to different failure modes. We find that several fragmented boulder morphologies are equivalent to morphologies observed on asteroid Bennu, suggesting that these morphologies on the Moon and on asteroids are likely not diagnostic of their formation mechanism. Our findings suggest that the boulder fragmentation process is characterized by an internal weakening period with limited morphological signs of damage at rock scale until a sudden highly efficient impact shattering event occurs. In addition, we identify new morphologies such as breccia boulders with an advection-like erosion style. We publicly release the produced fractured boulder catalog along with this paper.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v3 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.
    A Deep Double Ritz Method (D$^2$RM) for solving Partial Differential Equations using Neural Networks. (arXiv:2211.03627v3 [math.NA] UPDATED)
    Residual minimization is a widely used technique for solving Partial Differential Equations in variational form. It minimizes the dual norm of the residual, which naturally yields a saddle-point (min-max) problem over the so-called trial and test spaces. In the context of neural networks, we can address this min-max approach by employing one network to seek the trial minimum, while another network seeks the test maximizers. However, the resulting method is numerically unstable as we approach the trial solution. To overcome this, we reformulate the residual minimization as an equivalent minimization of a Ritz functional fed by optimal test functions computed from another Ritz functional minimization. We call the resulting scheme the Deep Double Ritz Method (D$^2$RM), which combines two neural networks for approximating trial functions and optimal test functions along a nested double Ritz minimization strategy. Numerical results on different diffusion and convection problems support the robustness of our method, up to the approximation properties of the networks and the training capacity of the optimizers.
    Enhancing Deep Learning with Scenario-Based Override Rules: a Case Study. (arXiv:2301.08114v1 [cs.SE])
    Deep neural networks (DNNs) have become a crucial instrument in the software development toolkit, due to their ability to efficiently solve complex problems. Nevertheless, DNNs are highly opaque, and can behave in an unexpected manner when they encounter unfamiliar input. One promising approach for addressing this challenge is by extending DNN-based systems with hand-crafted override rules, which override the DNN's output when certain conditions are met. Here, we advocate crafting such override rules using the well-studied scenario-based modeling paradigm, which produces rules that are simple, extensible, and powerful enough to ensure the safety of the DNN, while also rendering the system more translucent. We report on two extensive case studies, which demonstrate the feasibility of the approach; and through them, propose an extension to scenario-based modeling, which facilitates its integration with DNN components. We regard this work as a step towards creating safer and more reliable DNN-based systems and models.
    Dual Personalization on Federated Recommendation. (arXiv:2301.08143v1 [cs.IR])
    Federated recommendation is a new Internet service architecture that aims to provide privacy-preserving recommendation services in federated settings. Existing solutions are used to combine distributed recommendation algorithms and privacy-preserving mechanisms. Thus it inherently takes the form of heavyweight models at the server and hinders the deployment of on-device intelligent models to end-users. This paper proposes a novel Personalized Federated Recommendation (PFedRec) framework to learn many user-specific lightweight models to be deployed on smart devices rather than a heavyweight model on a server. Moreover, we propose a new dual personalization mechanism to effectively learn fine-grained personalization on both users and items. The overall learning process is formulated into a unified federated optimization framework. Specifically, unlike previous methods that share exactly the same item embeddings across users in a federated system, dual personalization allows mild finetuning of item embeddings for each user to generate user-specific views for item representations which can be integrated into existing federated recommendation methods to gain improvements immediately. Experiments on multiple benchmark datasets have demonstrated the effectiveness of PFedRec and the dual personalization mechanism. Moreover, we provide visualizations and in-depth analysis of the personalization techniques in item embedding, which shed novel insights on the design of RecSys in federated settings.
    EPiC-GAN: Equivariant Point Cloud Generation for Particle Jets. (arXiv:2301.08128v1 [hep-ph])
    With the vast data-collecting capabilities of current and future high-energy collider experiments, there is an increasing demand for computationally efficient simulations. Generative machine learning models enable fast event generation, yet so far these approaches are largely constrained to fixed data structures and rigid detector geometries. In this paper, we introduce EPiC-GAN - equivariant point cloud generative adversarial network - which can produce point clouds of variable multiplicity. This flexible framework is based on deep sets and is well suited for simulating sprays of particles called jets. The generator and discriminator utilize multiple EPiC layers with an interpretable global latent vector. Crucially, the EPiC layers do not rely on pairwise information sharing between particles, which leads to a significant speed-up over graph- and transformer-based approaches with more complex relation diagrams. We demonstrate that EPiC-GAN scales well to large particle multiplicities and achieves high generation fidelity on benchmark jet generation tasks.
    Score-based Causal Representation Learning with Interventions. (arXiv:2301.08230v1 [stat.ML])
    This paper studies causal representation learning problem when the latent causal variables are observed indirectly through an unknown linear transformation. The objectives are: (i) recovering the unknown linear transformation (up to scaling and ordering), and (ii) determining the directed acyclic graph (DAG) underlying the latent variables. Since identifiable representation learning is impossible based on only observational data, this paper uses both observational and interventional data. The interventional data is generated under distinct single-node randomized hard and soft interventions. These interventions are assumed to cover all nodes in the latent space. It is established that the latent DAG structure can be recovered under soft randomized interventions via the following two steps. First, a set of transformation candidates is formed by including all inverting transformations corresponding to which the \emph{score} function of the transformed variables has the minimal number of coordinates that change between an interventional and the observational environment summed over all pairs. Subsequently, this set is distilled using a simple constraint to recover the latent DAG structure. For the special case of hard randomized interventions, with an additional hypothesis testing step, one can also uniquely recover the linear transformation, up to scaling and a valid causal ordering. These results generalize the recent results that either assume deterministic hard interventions or linear causal relationships in the latent space.
    Soft-labeling Strategies for Rapid Sub-Typing. (arXiv:2209.12684v2 [cs.LG] UPDATED)
    The challenge of labeling large example datasets for computer vision continues to limit the availability and scope of image repositories. This research provides a new method for automated data collection, curation, labeling, and iterative training with minimal human intervention for the case of overhead satellite imagery and object detection. The new operational scale effectively scanned an entire city (68 square miles) in grid search and yielded a prediction of car color from space observations. A partially trained yolov5 model served as an initial inference seed to output further, more refined model predictions in iterative cycles. Soft labeling here refers to accepting label noise as a potentially valuable augmentation to reduce overfitting and enhance generalized predictions to previously unseen test data. The approach takes advantage of a real-world instance where a cropped image of a car can automatically receive sub-type information as white or colorful from pixel values alone, thus completing an end-to-end pipeline without overdependence on human labor.
    A Convenient Infinite Dimensional Framework for Generative Adversarial Learning. (arXiv:2011.12087v4 [cs.LG] UPDATED)
    In recent years, generative adversarial networks (GANs) have demonstrated impressive experimental results while there are only a few works that foster statistical learning theory for GANs. In this work, we propose an infinite dimensional theoretical framework for generative adversarial learning. We assume that the probability density functions of the underlying measure are uniformly bounded, $k$-times $\alpha$-H\"{o}lder differentiable ($C^{k,\alpha}$) and uniformly bounded away from zero. Under these assumptions, we show that the Rosenblatt transformation induces an optimal generator, which is realizable in the hypothesis space of $C^{k,\alpha}$-generators. With a consistent definition of the hypothesis space of discriminators, we further show that the Jensen-Shannon divergence between the distribution induced by the generator from the adversarial learning procedure and the data generating distribution converges to zero. Under certain regularity assumptions on the density of the data generating process, we also provide rates of convergence based on chaining and concentration.
    Scalable Causal Structure Learning: Scoping Review of Traditional and Deep Learning Algorithms and New Opportunities in Biomedicine. (arXiv:2110.07785v2 [cs.LG] UPDATED)
    Causal structure learning refers to a process of identifying causal structures from observational data, and it can have multiple applications in biomedicine and health care. This paper provides a practical review and tutorial on scalable causal structure learning models with examples of real-world data to help health care audiences understand and apply them. We reviewed traditional (combinatorial and score-based methods) for causal structure discovery and machine learning-based schemes. We also highlighted recent developments in biomedicine where causal structure learning can be applied to discover structures such as gene networks, brain connectivity networks, and those in cancer epidemiology. We also compared the performance of traditional and machine learning-based algorithms for causal discovery over some benchmark data sets. Machine learning-based approaches, including deep learning, have many advantages over traditional approaches, such as scalability, including a greater number of variables, and potentially being applied in a wide range of biomedical applications, such as genetics, if sufficient data are available. Furthermore, these models are more flexible than traditional models and are poised to positively affect many applications in the future.
    An Analysis of Semantically-Aligned Speech-Text Embeddings. (arXiv:2204.01235v2 [cs.CL] UPDATED)
    Embeddings play an important role in end-to-end solutions for multi-modal language processing problems. Although there has been some effort to understand the properties of single-modality embedding spaces, particularly that of text, their cross-modal counterparts are less understood. In this work, we study some intrinsic properties of a joint speech-text embedding space, constructed by minimizing the distance between paired utterance and transcription inputs in a teacher-student model setup, that are informative for several prominent use cases. We found that incorporating automatic speech recognition through both pretraining and multitask scenarios aid semantic alignment significantly, resulting in more tightly coupled embeddings. To analyse cross-modal embeddings we utilise a quantitative retrieval accuracy metric for semantic alignment, zero-shot classification for generalisability, and probing of the encoders to observe the extent of knowledge transfer from one modality to another.
    FedPara: Low-Rank Hadamard Product for Communication-Efficient Federated Learning. (arXiv:2108.06098v3 [cs.LG] UPDATED)
    In this work, we propose a communication-efficient parameterization, FedPara, for federated learning (FL) to overcome the burdens on frequent model uploads and downloads. Our method re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity. This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers, which is not achievable by the traditional low-rank methods. The efficiency of our method can be further improved by combining with other efficient FL optimizers. In addition, we extend our method to a personalized FL application, pFedPara, which separates parameters into global and local ones. We show that pFedPara outperforms competing personalized FL methods with more than three times fewer parameters.
    Dimensionality Reduction using Elastic Measures. (arXiv:2209.04933v3 [cs.LG] UPDATED)
    With the recent surge in big data analytics for hyper-dimensional data there is a renewed interest in dimensionality reduction techniques for machine learning applications. In order for these methods to improve performance gains and understanding of the underlying data, a proper metric needs to be identified. This step is often overlooked and metrics are typically chosen without consideration of the underlying geometry of the data. In this paper, we present a method for incorporating elastic metrics into the t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). We apply our method to functional data, which is uniquely characterized by rotations, parameterization, and scale. If these properties are ignored, they can lead to incorrect analysis and poor classification performance. Through our method we demonstrate improved performance on shape identification tasks for three benchmark data sets (MPEG-7, Car data set, and Plane data set of Thankoor), where we achieve 0.77, 0.95, and 1.00 F1 score, respectively.
    DiME: Maximizing Mutual Information by a Difference of Matrix-Based Entropies. (arXiv:2301.08164v1 [cs.LG])
    We introduce an information-theoretic quantity with similar properties to mutual information that can be estimated from data without making explicit assumptions on the underlying distribution. This quantity is based on a recently proposed matrix-based entropy that uses the eigenvalues of a normalized Gram matrix to compute an estimate of the eigenvalues of an uncentered covariance operator in a reproducing kernel Hilbert space. We show that a difference of matrix-based entropies (DiME) is well suited for problems involving maximization of mutual information between random variables. While many methods for such tasks can lead to trivial solutions, DiME naturally penalizes such outcomes. We provide several examples of use cases for the proposed quantity including a multi-view representation learning problem where DiME is used to encourage learning a shared representation among views with high mutual information. We also show the versatility of DiME by using it as objective function for a variety of tasks.
    Hamiltonian Neural Networks with Automatic Symmetry Detection. (arXiv:2301.07928v1 [cs.LG])
    Recently, Hamiltonian neural networks (HNN) have been introduced to incorporate prior physical knowledge when learning the dynamical equations of Hamiltonian systems. Hereby, the symplectic system structure is preserved despite the data-driven modeling approach. However, preserving symmetries requires additional attention. In this research, we enhance the HNN with a Lie algebra framework to detect and embed symmetries in the neural network. This approach allows to simultaneously learn the symmetry group action and the total energy of the system. As illustrating examples, a pendulum on a cart and a two-body problem from astrodynamics are considered.
    AtMan: Understanding Transformer Predictions Through Memory Efficient Attention Manipulation. (arXiv:2301.08110v1 [cs.LG])
    Generative transformer models have become increasingly complex, with large numbers of parameters and the ability to process multiple input modalities. Current methods for explaining their predictions are resource-intensive. Most crucially, they require prohibitively large amounts of extra memory, since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This makes it difficult, if not impossible, to use them in production. We present AtMan that provides explanations of generative transformer models at almost no extra cost. Specifically, AtMan is a modality-agnostic perturbation method that manipulates the attention mechanisms of transformers to produce relevance maps for the input with respect to the output prediction. Instead of using backpropagation, AtMan applies a parallelizable token-based search method based on cosine similarity neighborhood in the embedding space. Our exhaustive experiments on text and image-text benchmarks demonstrate that AtMan outperforms current state-of-the-art gradient-based methods on several metrics while being computationally efficient. As such, AtMan is suitable for use in large model inference deployments.
    DynInt: Dynamic Interaction Modeling for Large-scale Click-Through Rate Prediction. (arXiv:2301.08139v1 [cs.IR])
    Learning feature interactions is the key to success for the large-scale CTR prediction in Ads ranking and recommender systems. In industry, deep neural network-based models are widely adopted for modeling such problems. Researchers proposed various neural network architectures for searching and modeling the feature interactions in an end-to-end fashion. However, most methods only learn static feature interactions and have not fully leveraged deep CTR models' representation capacity. In this paper, we propose a new model: DynInt. By extending Polynomial-Interaction-Network (PIN), which learns higher-order interactions recursively to be dynamic and data-dependent, DynInt further derived two modes for modeling dynamic higher-order interactions: dynamic activation and dynamic parameter. In dynamic activation mode, we adaptively adjust the strength of learned interactions by instance-aware activation gating networks. In dynamic parameter mode, we re-parameterize the parameters by different formulations and dynamically generate the parameters by instance-aware parameter generation networks. Through instance-aware gating mechanism and dynamic parameter generation, we enable the PIN to model dynamic interaction for potential industry applications. We implement the proposed model and evaluate the model performance on real-world datasets. Extensive experiment results demonstrate the efficiency and effectiveness of DynInt over state-of-the-art models.
    Fast Vision Transformers with HiLo Attention. (arXiv:2205.13213v4 [cs.CV] UPDATED)
    Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of different frequencies. Therefore, we propose to disentangle the high/low frequency patterns in an attention layer by separating the heads into two groups, where one group encodes high frequencies via self-attention within each local window, and another group encodes low frequencies by performing global attention between the average-pooled low-frequency keys and values from each window and each query position in the input feature map. Benefiting from the efficient design for both groups, we show that HiLo is superior to the existing attention mechanisms by comprehensively benchmarking FLOPs, speed and memory consumption on GPUs and CPUs. For example, HiLo is 1.4x faster than spatial reduction attention and 1.6x faster than local window attention on CPUs. Powered by HiLo, LITv2 serves as a strong backbone for mainstream vision tasks including image classification, dense detection and segmentation. Code is available at https://github.com/ziplab/LITv2.
    The secret role of undesired physical effects in accurate shape sensing with eccentric FBGs. (arXiv:2210.16316v2 [cs.LG] UPDATED)
    Fiber optic shape sensors have enabled unique advances in various navigation tasks, from medical tool tracking to industrial applications. Eccentric fiber Bragg gratings (FBG) are cheap and easy-to-fabricate shape sensors that are often interrogated with simple setups. However, using low-cost interrogation systems for such intensity-based quasi-distributed sensors introduces further complications to the sensor's signal. Therefore, eccentric FBGs have not been able to accurately estimate complex multi-bend shapes. Here, we present a novel technique to overcome these limitations and provide accurate and precise shape estimation in eccentric FBG sensors. We investigate the most important bending-induced effects in curved optical fibers that are usually eliminated in intensity-based fiber sensors. These effects contain shape deformation information with a higher spatial resolution that we are now able to extract using deep learning techniques. We design a deep learning model based on a convolutional neural network that is trained to predict shapes given the sensor's spectra. We also provide a visual explanation, highlighting wavelength elements whose intensities are more relevant in making shape predictions. These findings imply that deep learning techniques benefit from the bending-induced effects that impact the desired signal in a complex manner. This is the first step toward cheap yet accurate fiber shape sensing solutions.
    Learning programs by combining programs. (arXiv:2206.01614v2 [cs.LG] UPDATED)
    The goal of inductive logic programming is to induce a logic program (a set of logical rules) that generalises training examples. Inducing programs with many rules and literals is a major challenge. To tackle this challenge, we introduce an approach where we learn small non-separable programs and combine them. We implement our approach in a constraint-driven ILP system. Our approach can learn optimal and recursive programs and perform predicate invention. Our experiments on multiple domains, including game playing and program synthesis, show that our approach can drastically outperform existing approaches in terms of predictive accuracies and learning times, sometimes reducing learning times from over an hour to a few seconds.
    Diffusion-based Conditional ECG Generation with Structured State Space Models. (arXiv:2301.08227v1 [eess.SP])
    Synthetic data generation is a promising solution to address privacy issues with the distribution of sensitive health data. Recently, diffusion models have set new standards for generative models for different data modalities. Also very recently, structured state space models emerged as a powerful modeling paradigm to capture long-term dependencies in time series. We put forward SSSD-ECG, as the combination of these two technologies, for the generation of synthetic 12-lead electrocardiograms conditioned on more than 70 ECG statements. Due to a lack of reliable baselines, we also propose conditional variants of two state-of-the-art unconditional generative models. We thoroughly evaluate the quality of the generated samples, by evaluating pretrained classifiers on the generated data and by evaluating the performance of a classifier trained only on synthetic data, where SSSD-ECG clearly outperforms its GAN-based competitors. We demonstrate the soundness of our approach through further experiments, including conditional class interpolation and a clinical Turing test demonstrating the high quality of the SSSD-ECG samples across a wide range of conditions.
    Graph Data Augmentation for Graph Machine Learning: A Survey. (arXiv:2202.08871v2 [cs.LG] UPDATED)
    Data augmentation has recently seen increased interest in graph machine learning given its demonstrated ability to improve model performance and generalization by added training data. Despite this recent surge, the area is still relatively under-explored, due to the challenges brought by complex, non-Euclidean structure of graph data, which limits the direct analogizing of traditional augmentation operations on other types of image, video or text data. Our work aims to give a necessary and timely overview of existing graph data augmentation methods; notably, we present a comprehensive and systematic survey of graph data augmentation approaches, summarizing the literature in a structured manner. We first introduce three different taxonomies for categorizing graph data augmentation methods from the data, task, and learning perspectives, respectively. Next, we introduce recent advances in graph data augmentation, differentiated by their methodologies and applications. We conclude by outlining currently unsolved challenges and directions for future research. Overall, our work aims to clarify the landscape of existing literature in graph data augmentation and motivates additional work in this area, providing a helpful resource for researchers and practitioners in the broader graph machine learning domain. Additionally, we provide a continuously updated reading list at https://github.com/zhao-tong/graph-data-augmentation-papers.
    Job recommendations: benchmarking of collaborative filtering methods for classifieds. (arXiv:2301.07946v1 [cs.IR])
    Classifieds provide many challenges for recommendation methods, due to the limited information regarding users and items. In this paper, we explore recommendation methods for classifieds using the example of OLX Jobs. The goal of the paper is to benchmark different recommendation methods for jobs classifieds in order to improve advertisements' conversion rate and user satisfaction. In our research, we implemented methods that are scalable and represent different approaches to recommendation, namely ALS, LightFM, Prod2Vec, RP3beta, and SLIM. We performed a laboratory comparison of methods with regard to accuracy, diversity, and scalability (memory and time consumption during training and in prediction). Online A/B tests were also carried out by sending millions of messages with recommendations to evaluate models in a real-world setting. In addition, we have published the dataset that we created for the needs of our research. To the best of our knowledge, this is the first dataset of this kind. The dataset contains 65,502,201 events performed on OLX Jobs by 3,295,942 users, who interacted with (displayed, replied to, or bookmarked) 185,395 job ads in two weeks of 2020. We demonstrate that RP3beta, SLIM, and ALS perform significantly better than Prod2Vec and LightFM when tested in a laboratory setting. Online A/B tests also demonstrated that sending messages with recommendations generated by the ALS and RP3beta models increases the number of users contacting advertisers. Additionally, RP3beta had a 20% greater impact on this metric than ALS.
    Everything is Connected: Graph Neural Networks. (arXiv:2301.08210v1 [cs.LG])
    In many ways, graphs are the main modality of data we receive from nature. This is due to the fact that most of the patterns we see, both in natural and artificial systems, are elegantly representable using the language of graph structures. Prominent examples include molecules (represented as graphs of atoms and bonds), social networks and transportation networks. This potential has already been seen by key scientific and industrial groups, with already-impacted application areas including traffic forecasting, drug discovery, social network analysis and recommender systems. Further, some of the most successful domains of application for machine learning in previous years -- images, text and speech processing -- can be seen as special cases of graph representation learning, and consequently there has been significant exchange of information between these areas. The main aim of this short survey is to enable the reader to assimilate the key concepts in the area, and position graph representation learning in a proper context with related fields.
    Automated deep reinforcement learning for real-time scheduling strategy of multi-energy system integrated with post-carbon and direct-air carbon captured system. (arXiv:2301.07768v1 [eess.SY])
    The carbon-capturing process with the aid of CO2 removal technology (CDRT) has been recognised as an alternative and a prominent approach to deep decarbonisation. However, the main hindrance is the enormous energy demand and the economic implication of CDRT if not effectively managed. Hence, a novel deep reinforcement learning agent (DRL), integrated with an automated hyperparameter selection feature, is proposed in this study for the real-time scheduling of a multi-energy system coupled with CDRT. Post-carbon capture systems (PCCS) and direct-air capture systems (DACS) are considered CDRT. Various possible configurations are evaluated using real-time multi-energy data of a district in Arizona and CDRT parameters from manufacturers' catalogues and pilot project documentation. The simulation results validate that an optimised soft-actor critic (SAC) algorithm outperformed the TD3 algorithm due to its maximum entropy feature. We then trained four (4) SAC agents, equivalent to the number of considered case studies, using optimised hyperparameter values and deployed them in real time for evaluation. The results show that the proposed DRL agent can meet the prosumers' multi-energy demand and schedule the CDRT energy demand economically without specified constraints violation. Also, the proposed DRL agent outperformed rule-based scheduling by 23.65%. However, the configuration with PCCS and solid-sorbent DACS is considered the most suitable configuration with a high CO2 captured-released ratio of 38.54, low CO2 released indicator value of 2.53, and a 36.5% reduction in CDR cost due to waste heat utilisation and high absorption capacity of the selected sorbent. However, the adoption of CDRT is not economically viable at the current carbon price. Finally, we showed that CDRT would be attractive at a carbon price of 400-450USD/ton with the provision of tax incentives by the policymakers.
    Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity -- the Strongly Convex Case. (arXiv:2301.08039v1 [math.PR])
    In this article we consider sampling from log concave distributions in Hamiltonian setting, without assuming that the objective gradient is globally Lipschitz. We propose two algorithms based on monotone polygonal (tamed) Euler schemes, to sample from a target measure, and provide non-asymptotic 2-Wasserstein distance bounds between the law of the process of each algorithm and the target measure. Finally, we apply these results to bound the excess risk optimization error of the associated optimization problem.
    A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs. (arXiv:2301.08187v1 [stat.ML])
    U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.
    What's happening in your neighborhood? A Weakly Supervised Approach to Detect Local News. (arXiv:2301.08146v1 [cs.IR])
    Local news articles are a subset of news that impact users in a geographical area, such as a city, county, or state. Detecting local news (Step 1) and subsequently deciding its geographical location as well as radius of impact (Step 2) are two important steps towards accurate local news recommendation. Naive rule-based methods, such as detecting city names from the news title, tend to give erroneous results due to lack of understanding of the news content. Empowered by the latest development in natural language processing, we develop an integrated pipeline that enables automatic local news detection and content-based local news recommendations. In this paper, we focus on Step 1 of the pipeline, which highlights: (1) a weakly supervised framework incorporated with domain knowledge and auto data processing, and (2) scalability to multi-lingual settings. Compared with Stanford CoreNLP NER model, our pipeline has higher precision and recall evaluated on a real-world and human-labeled dataset. This pipeline has potential to more precise local news to users, helps local businesses get more exposure, and gives people more information about their neighborhood safety.
    On backpropagating Hessians through ODEs. (arXiv:2301.08085v1 [math.OC])
    We discuss the problem of numerically backpropagating Hessians through ordinary differential equations (ODEs) in various contexts and elucidate how different approaches may be favourable in specific situations. We discuss both theoretical and pragmatic aspects such as, respectively, bounds on computational effort and typical impact of framework overhead. Focusing on the approach of hand-implemented ODE-backpropagation, we develop the computation for the Hessian of orbit-nonclosure for a mechanical system. We also clarify the mathematical framework for extending the backward-ODE-evolution of the costate-equation to Hessians, in its most generic form. Some calculations, such as that of the Hessian for orbit non-closure, are performed in a language, defined in terms of a formal grammar, that we introduce to facilitate the tracking of intermediate quantities. As pedagogical examples, we discuss the Hessian of orbit-nonclosure for the higher dimensional harmonic oscillator and conceptually related problems in Newtonian gravitational theory. In particular, applying our approach to the figure-8 three-body orbit, we readily rediscover a distorted-figure-8 solution originally described by Sim\'o. Possible applications may include: improvements to training of `neural ODE'- type deep learning with second-order methods, numerical analysis of quantum corrections around classical paths, and, more broadly, studying options for adjusting an ODE's initial configuration such that the impact on some given objective function is small.
    Shapley Values with Uncertain Value Functions. (arXiv:2301.08086v1 [cs.LG])
    We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.
    Differentially Private Online Bayesian Estimation With Adaptive Truncation. (arXiv:2301.08202v1 [cs.LG])
    We propose a novel online and adaptive truncation method for differentially private Bayesian online estimation of a static parameter regarding a population. We assume that sensitive information from individuals is collected sequentially and the inferential aim is to estimate, on-the-fly, a static parameter regarding the population to which those individuals belong. We propose sequential Monte Carlo to perform online Bayesian estimation. When individuals provide sensitive information in response to a query, it is necessary to perturb it with privacy-preserving noise to ensure the privacy of those individuals. The amount of perturbation is proportional to the sensitivity of the query, which is determined usually by the range of the queried information. The truncation technique we propose adapts to the previously collected observations to adjust the query range for the next individual. The idea is that, based on previous observations, we can carefully arrange the interval into which the next individual's information is to be truncated before being perturbed with privacy-preserving noise. In this way, we aim to design predictive queries with small sensitivity, hence small privacy-preserving noise, enabling more accurate estimation while maintaining the same level of privacy. To decide on the location and the width of the interval, we use an exploration-exploitation approach a la Thompson sampling with an objective function based on the Fisher information of the generated observation. We show the merits of our methodology with numerical examples.
    Getting Away with More Network Pruning: From Sparsity to Geometry and Linear Regions. (arXiv:2301.07966v1 [cs.LG])
    One surprising trait of neural networks is the extent to which their connections can be pruned with little to no effect on accuracy. But when we cross a critical level of parameter sparsity, pruning any further leads to a sudden drop in accuracy. This drop plausibly reflects a loss in model complexity, which we aim to avoid. In this work, we explore how sparsity also affects the geometry of the linear regions defined by a neural network, and consequently reduces the expected maximum number of linear regions based on the architecture. We observe that pruning affects accuracy similarly to how sparsity affects the number of linear regions and our proposed bound for the maximum number. Conversely, we find out that selecting the sparsity across layers to maximize our bound very often improves accuracy in comparison to pruning as much with the same sparsity in all layers, thereby providing us guidance on where to prune.
    BO-DBA: Query-Efficient Decision-Based Adversarial Attacks via Bayesian Optimization. (arXiv:2106.02732v2 [cs.LG] UPDATED)
    Decision-based attacks (DBA), wherein attackers perturb inputs to spoof learning algorithms by observing solely the output labels, are a type of severe adversarial attacks against Deep Neural Networks (DNNs) requiring minimal knowledge of attackers. State-of-the-art DBA attacks relying on zeroth-order gradient estimation require an excessive number of queries. Recently, Bayesian optimization (BO) has shown promising in reducing the number of queries in score-based attacks (SBA), in which attackers need to observe real-valued probability scores as outputs. However, extending BO to the setting of DBA is nontrivial because in DBA only output labels instead of real-valued scores, as needed by BO, are available to attackers. In this paper, we close this gap by proposing an efficient DBA attack, namely BO-DBA. Different from existing approaches, BO-DBA generates adversarial examples by searching so-called \emph{directions of perturbations}. It then formulates the problem as a BO problem that minimizes the real-valued distortion of perturbations. With the optimized perturbation generation process, BO-DBA converges much faster than the state-of-the-art DBA techniques. Experimental results on pre-trained ImageNet classifiers show that BO-DBA converges within 200 queries while the state-of-the-art DBA techniques need over 15,000 queries to achieve the same level of perturbation distortion. BO-DBA also shows similar attack success rates even as compared to BO-based SBA attacks but with less distortion.
    Hybrid thermal modeling of additive manufacturing processes using physics-informed neural networks for temperature prediction and parameter identification. (arXiv:2206.07756v2 [cs.LG] UPDATED)
    Understanding the thermal behavior of additive manufacturing (AM) processes is crucial for enhancing the quality control and enabling customized process design. Most purely physics-based computational models suffer from intensive computational costs and the need of calibrating unknown parameters, thus not suitable for online control and iterative design application. Data-driven models taking advantage of the latest developed computational tools can serve as a more efficient surrogate, but they are usually trained over a large amount of simulation data and often fail to effectively use small but high-quality experimental data. In this work, we developed a hybrid physics-based data-driven thermal modeling approach of AM processes using physics-informed neural networks. Specifically, partially observed temperature data measured from an infrared camera is combined with the physics laws to predict full-field temperature history and to discover unknown material and process parameters. In the numerical and experimental examples, the effectiveness of adding auxiliary training data and using the pretrained model on training efficiency and prediction accuracy, as well as the ability to identify unknown parameters with partially observed data, are demonstrated. The results show that the hybrid thermal model can effectively identify unknown parameters and capture the full-field temperature accurately, and thus it has the potential to be used in iterative process design and real-time process control of AM.
    Music Playlist Title Generation Using Artist Information. (arXiv:2301.08145v1 [cs.IR])
    Automatically generating or captioning music playlist titles given a set of tracks is of significant interest in music streaming services as customized playlists are widely used in personalized music recommendation, and well-composed text titles attract users and help their music discovery. We present an encoder-decoder model that generates a playlist title from a sequence of music tracks. While previous work takes track IDs as tokenized input for playlist title generation, we use artist IDs corresponding to the tracks to mitigate the issue from the long-tail distribution of tracks included in the playlist dataset. Also, we introduce a chronological data split method to deal with newly-released tracks in real-world scenarios. Comparing the track IDs and artist IDs as input sequences, we show that the artist-based approach significantly enhances the performance in terms of word overlap, semantic relevance, and diversity.
    GIPA++: A General Information Propagation Algorithm for Graph Learning. (arXiv:2301.08209v1 [cs.LG])
    Graph neural networks (GNNs) have been widely used in graph-structured data computation, showing promising performance in various applications such as node classification, link prediction, and network recommendation. Existing works mainly focus on node-wise correlation when doing weighted aggregation of neighboring nodes based on attention, such as dot product by the dense vectors of two nodes. This may cause conflicting noise in nodes to be propagated when doing information propagation. To solve this problem, we propose a General Information Propagation Algorithm (GIPA in short), which exploits more fine-grained information fusion including bit-wise and feature-wise correlations based on edge features in their propagation. Specifically, the bit-wise correlation calculates the element-wise attention weight through a multi-layer perceptron (MLP) based on the dense representations of two nodes and their edge; The feature-wise correlation is based on the one-hot representations of node attribute features for feature selection. We evaluate the performance of GIPA on the Open Graph Benchmark proteins (OGBN-proteins for short) dataset and the Alipay dataset of Alibaba. Experimental results reveal that GIPA outperforms the state-of-the-art models in terms of prediction accuracy, e.g., GIPA achieves an average ROC-AUC of $0.8901\pm 0.0011$, which is better than that of all the existing methods listed in the OGBN-proteins leaderboard.
    Building Concise Logical Patterns by Constraining Tsetlin Machine Clause Size. (arXiv:2301.08190v1 [cs.LG])
    Tsetlin machine (TM) is a logic-based machine learning approach with the crucial advantages of being transparent and hardware-friendly. While TMs match or surpass deep learning accuracy for an increasing number of applications, large clause pools tend to produce clauses with many literals (long clauses). As such, they become less interpretable. Further, longer clauses increase the switching activity of the clause logic in hardware, consuming more power. This paper introduces a novel variant of TM learning - Clause Size Constrained TMs (CSC-TMs) - where one can set a soft constraint on the clause size. As soon as a clause includes more literals than the constraint allows, it starts expelling literals. Accordingly, oversized clauses only appear transiently. To evaluate CSC-TM, we conduct classification, clustering, and regression experiments on tabular data, natural language text, images, and board games. Our results show that CSC-TM maintains accuracy with up to 80 times fewer literals. Indeed, the accuracy increases with shorter clauses for TREC, IMDb, and BBC Sports. After the accuracy peaks, it drops gracefully as the clause size approaches a single literal. We finally analyze CSC-TM power consumption and derive new convergence properties.
    Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation. (arXiv:2206.11489v2 [cs.LG] UPDATED)
    We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
    A Survey of Meta-Reinforcement Learning. (arXiv:2301.08028v1 [cs.LG])
    While deep reinforcement learning (RL) has fueled multiple high-profile successes in machine learning, it is held back from more widespread adoption by its often poor data efficiency and the limited generality of the policies it produces. A promising approach for alleviating these limitations is to cast the development of better RL algorithms as a machine learning problem itself in a process called meta-RL. Meta-RL is most commonly studied in a problem setting where, given a distribution of tasks, the goal is to learn a policy that is capable of adapting to any new task from the task distribution with as little data as possible. In this survey, we describe the meta-RL problem setting in detail as well as its major variations. We discuss how, at a high level, meta-RL research can be clustered based on the presence of a task distribution and the learning budget available for each individual task. Using these clusters, we then survey meta-RL algorithms and applications. We conclude by presenting the open problems on the path to making meta-RL part of the standard toolbox for a deep RL practitioner.
    Geometric path augmentation for inference of sparsely observed stochastic nonlinear systems. (arXiv:2301.08102v1 [physics.data-an])
    Stochastic evolution equations describing the dynamics of systems under the influence of both deterministic and stochastic forces are prevalent in all fields of science. Yet, identifying these systems from sparse-in-time observations remains still a challenging endeavour. Existing approaches focus either on the temporal structure of the observations by relying on conditional expectations, discarding thereby information ingrained in the geometry of the system's invariant density; or employ geometric approximations of the invariant density, which are nevertheless restricted to systems with conservative forces. Here we propose a method that reconciles these two paradigms. We introduce a new data-driven path augmentation scheme that takes the local observation geometry into account. By employing non-parametric inference on the augmented paths, we can efficiently identify the deterministic driving forces of the underlying system for systems observed at low sampling rates.
    Learning to Rank by Causal Effects Without Data to Accurately Estimate Causal Effects. (arXiv:2206.12532v2 [stat.ML] UPDATED)
    Decision makers often want to identify the individuals for whom some intervention or treatment will be most effective in order to decide who to treat. In such cases, decision makers would ideally like to rank potential recipients of the treatment according to their individual causal effects. However, the available data may be completely inadequate to estimate causal effects accurately. We formalize a new assumption -- the rank preservation assumption (RPA) -- that defines when data are suitable to learn how to rank individuals according to their causal effects, even if the effects themselves cannot be accurately estimated. The RPA holds when there is data to estimate a scoring variable that induces the same ranking of individuals as the causal effect of interest. Some of the scoring variables we consider are confounded estimates, proxy causal effects, and non-causal quantities. We show that such scoring variables can work well for treatment assignment if the RPA is met, and potentially even better than using causal effects as scores. We also show that the RPA holds under conditions that are more general and weaker than the typical assumptions made in observational studies. Finally, we showcase how practitioners can apply and evaluate alternative scoring models (including non-causal models) to maximize the causal impact of their targeting decisions.
    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. (arXiv:2301.08243v1 [cs.CV])
    This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) predict several target blocks in the image, (b) sample target blocks with sufficiently large scale (occupying 15%-20% of the image), and (c) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/16 on ImageNet using 32 A100 GPUs in under 38 hours to achieve strong downstream performance across a wide range of tasks requiring various levels of abstraction, from linear classification to object counting and depth prediction.
    TINKER: A framework for Open source Cyberthreat Intelligence. (arXiv:2102.05571v6 [cs.CR] UPDATED)
    Threat intelligence on malware attacks and campaigns is increasingly being shared with other security experts for a cost or for free. Other security analysts use this intelligence to inform them of indicators of compromise, attack techniques, and preventative actions. Security analysts prepare threat analysis reports after investigating an attack, an emerging cyber threat, or a recently discovered vulnerability. Collectively known as cyber threat intelligence (CTI), the reports are typically in an unstructured format and, therefore, challenging to integrate seamlessly into existing intrusion detection systems. This paper proposes a framework that uses the aggregated CTI for analysis and defense at scale. The information is extracted and stored in a structured format using knowledge graphs such that the semantics of the threat intelligence can be preserved and shared at scale with other security analysts. Specifically, we propose the first semi-supervised open-source knowledge graph-based framework, TINKER, to capture cyber threat information and its context. Following TINKER, we generate a Cyberthreat Intelligence Knowledge Graph (CTI-KG) and demonstrate the usage using different use cases.
    CEnt: An Entropy-based Model-agnostic Explainability Framework to Contrast Classifiers' Decisions. (arXiv:2301.07941v1 [cs.LG])
    Current interpretability methods focus on explaining a particular model's decision through present input features. Such methods do not inform the user of the sufficient conditions that alter these decisions when they are not desirable. Contrastive explanations circumvent this problem by providing explanations of the form "If the feature $X>x$, the output $Y$ would be different''. While different approaches are developed to find contrasts; these methods do not all deal with mutability and attainability constraints. In this work, we present a novel approach to locally contrast the prediction of any classifier. Our Contrastive Entropy-based explanation method, CEnt, approximates a model locally by a decision tree to compute entropy information of different feature splits. A graph, G, is then built where contrast nodes are found through a one-to-many shortest path search. Contrastive examples are generated from the shortest path to reflect feature splits that alter model decisions while maintaining lower entropy. We perform local sampling on manifold-like distances computed by variational auto-encoders to reflect data density. CEnt is the first non-gradient-based contrastive method generating diverse counterfactuals that do not necessarily exist in the training data while satisfying immutability (ex. race) and semi-immutability (ex. age can only change in an increasing direction). Empirical evaluation on four real-world numerical datasets demonstrates the ability of CEnt in generating counterfactuals that achieve better proximity rates than existing methods without compromising latency, feasibility, and attainability. We further extend CEnt to imagery data to derive visually appealing and useful contrasts between class labels on MNIST and Fashion MNIST datasets. Finally, we show how CEnt can serve as a tool to detect vulnerabilities of textual classifiers.
    Federated Automatic Differentiation. (arXiv:2301.07806v1 [cs.LG])
    Federated learning (FL) is a general framework for learning across heterogeneous clients while preserving data privacy, under the orchestration of a central server. FL methods often compute gradients of loss functions purely locally (ie. entirely at each client, or entirely at the server), typically using automatic differentiation (AD) techniques. We propose a federated automatic differentiation (FAD) framework that 1) enables computing derivatives of functions involving client and server computation as well as communication between them and 2) operates in a manner compatible with existing federated technology. In other words, FAD computes derivatives across communication boundaries. We show, in analogy with traditional AD, that FAD may be implemented using various accumulation modes, which introduce distinct computation-communication trade-offs and systems requirements. Further, we show that a broad class of federated computations is closed under these various modes of FAD, implying in particular that if the original computation can be implemented using privacy-preserving primitives, its derivative may be computed using only these same primitives. We then show how FAD can be used to create algorithms that dynamically learn components of the algorithm itself. In particular, we show that FedAvg-style algorithms can exhibit significantly improved performance by using FAD to adjust the server optimization step automatically, or by using FAD to learn weighting schemes for computing weighted averages across clients.
    Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms. (arXiv:2301.08015v1 [cs.GT])
    Wide machine learning tasks can be formulated as non-convex multi-player games, where Nash equilibrium (NE) is an acceptable solution to all players, since no one can benefit from changing its strategy unilaterally. Attributed to the non-convexity, obtaining the existence condition of global NE is challenging, let alone designing theoretically guaranteed realization algorithms. This paper takes conjugate transformation to the formulation of non-convex multi-player games, and casts the complementary problem into a variational inequality (VI) problem with a continuous pseudo-gradient mapping. We then prove the existence condition of global NE: the solution to the VI problem satisfies a duality relation. Based on this VI formulation, we design a conjugate-based ordinary differential equation (ODE) to approach global NE, which is proved to have an exponential convergence rate. To make the dynamics more implementable, we further derive a discretized algorithm. We apply our algorithm to two typical scenarios: multi-player generalized monotone game and multi-player potential game. In the two settings, we prove that the step-size setting is required to be $\mathcal{O}(1/k)$ and $\mathcal{O}(1/\sqrt k)$ to yield the convergence rates of $\mathcal{O}(1/ k)$ and $\mathcal{O}(1/\sqrt k)$, respectively. Extensive experiments in robust neural network training and sensor localization are in full agreement with our theory.
    Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization. (arXiv:2301.07784v1 [cs.LG])
    Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state spaces.
    Neural Regression For Scale-Varying Targets. (arXiv:2211.07447v4 [cs.LG] UPDATED)
    In this work, we demonstrate that a major limitation of regression using a mean-squared error loss is its sensitivity to the scale of its targets. This makes learning settings consisting of target's whose values take on varying scales challenging. A recently-proposed alternative loss function, known as histogram loss, avoids this issue. However, its computational cost grows linearly with the number of buckets in the histogram, which renders prediction with real-valued targets intractable. To address this issue, we propose a novel approach to training deep learning models on real-valued regression targets, autoregressive regression, which learns a high-fidelity distribution by utilizing an autoregressive target decomposition. We demonstrate that this training objective allows us to solve regression tasks involving targets with different scales.
    Understanding the diffusion models by conditional expectations. (arXiv:2301.07882v1 [cs.LG])
    This paper provide several mathematical analyses of the diffusion model in machine learning. The drift term of the backwards sampling process is represented as a conditional expectation involving the data distribution and the forward diffusion. The training process aims to find such a drift function by minimizing the mean-squared residue related to the conditional expectation. Using small-time approximations of the Green's function of the forward diffusion, we show that the analytical mean drift function in DDPM and the score function in SGM asymptotically blow up in the final stages of the sampling process for singular data distributions such as those concentrated on lower-dimensional manifolds, and is therefore difficult to approximate by a network. To overcome this difficulty, we derive a new target function and associated loss, which remains bounded even for singular data distributions. We illustrate the theoretical findings with several numerical examples.
    Interval Reachability of Nonlinear Dynamical Systems with Neural Network Controllers. (arXiv:2301.07912v1 [eess.SY])
    This paper proposes a computationally efficient framework, based on interval analysis, for rigorous verification of nonlinear continuous-time dynamical systems with neural network controllers. Given a neural network, we use an existing verification algorithm to construct inclusion functions for its input-output behavior. Inspired by mixed monotone theory, we embed the closed-loop dynamics into a larger system using an inclusion function of the neural network and a decomposition function of the open-loop system. This embedding provides a scalable approach for safety analysis of the neural control loop while preserving the nonlinear structure of the system. We show that one can efficiently compute hyper-rectangular over-approximations of the reachable sets using a single trajectory of the embedding system. We design an algorithm to leverage this computational advantage through partitioning strategies, improving our reachable set estimates while balancing its runtime with tunable parameters. We demonstrate the performance of this algorithm through two case studies. First, we demonstrate this method's strength in complex nonlinear environments. Then, we show that our approach matches the performance of the state-of-the art verification algorithm for linear discretized systems.
    Learning-Rate-Free Learning by D-Adaptation. (arXiv:2301.07733v1 [cs.LG])
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of $D$ yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation).
    Continuously Reliable Detection of New-Normal Misinformation: Semantic Masking and Contrastive Smoothing in High-Density Latent Regions. (arXiv:2301.07981v1 [cs.LG])
    Toxic misinformation campaigns have caused significant societal harm, e.g., affecting elections and COVID-19 information awareness. Unfortunately, despite successes of (gold standard) retrospective studies of misinformation that confirmed their harmful effects after the fact, they arrive too late for timely intervention and reduction of such harm. By design, misinformation evades retrospective classifiers by exploiting two properties we call new-normal: (1) never-seen-before novelty that cause inescapable generalization challenges for previous classifiers, and (2) massive but short campaigns that end before they can be manually annotated for new classifier training. To tackle these challenges, we propose UFIT, which combines two techniques: semantic masking of strong signal keywords to reduce overfitting, and intra-proxy smoothness regularization of high-density regions in the latent space to improve reliability and maintain accuracy. Evaluation of UFIT on public new-normal misinformation data shows over 30% improvement over existing approaches on future (and unseen) campaigns. To the best of our knowledge, UFIT is the first successful effort to achieve such high level of generalization on new-normal misinformation data with minimal concession (1 to 5%) of accuracy compared to oracles trained with full knowledge of all campaigns.
    Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient. (arXiv:2301.08215v1 [cs.LG])
    A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.
    Multi-Agent Interplay in a Competitive Survival Environment. (arXiv:2301.08030v1 [cs.LG])
    Solving hard-exploration environments in an important challenge in Reinforcement Learning. Several approaches have been proposed and studied, such as Intrinsic Motivation, co-evolution of agents and tasks, and multi-agent competition. In particular, the interplay between multiple agents has proven to be capable of generating human-relevant emergent behaviour that would be difficult or impossible to learn in single-agent settings. In this work, an extensible competitive environment for multi-agent interplay was developed, which features realistic physics and human-relevant semantics. Moreover, several experiments on different variants of this environment were performed, resulting in some simple emergent strategies and concrete directions for future improvement. The content presented here is part of the author's thesis "Multi-Agent Interplay in a Competitive Survival Environment" for the Master's Degree in Artificial Intelligence and Robotics at Sapienza University of Rome, 2022.
    A Survey of Zero-shot Generalisation in Deep Reinforcement Learning. (arXiv:2111.09794v6 [cs.LG] UPDATED)
    The study of zero-shot generalisation (ZSG) in deep Reinforcement Learning (RL) aims to produce RL algorithms whose policies generalise well to novel unseen situations at deployment time, avoiding overfitting to their training environments. Tackling this is vital if we are to deploy reinforcement learning algorithms in real world scenarios, where the environment will be diverse, dynamic and unpredictable. This survey is an overview of this nascent field. We rely on a unifying formalism and terminology for discussing different ZSG problems, building upon previous works. We go on to categorise existing benchmarks for ZSG, as well as current methods for tackling these problems. Finally, we provide a critical discussion of the current state of the field, including recommendations for future work. Among other conclusions, we argue that taking a purely procedural content generation approach to benchmark design is not conducive to progress in ZSG, we suggest fast online adaptation and tackling RL-specific problems as some areas for future work on methods for ZSG, and we recommend building benchmarks in underexplored problem settings such as offline RL ZSG and reward-function variation.
    Explainability in subgraphs-enhanced Graph Neural Networks. (arXiv:2209.07926v2 [cs.LG] UPDATED)
    Recently, subgraphs-enhanced Graph Neural Networks (SGNNs) have been introduced to enhance the expressive power of Graph Neural Networks (GNNs), which was proved to be not higher than the 1-dimensional Weisfeiler-Leman isomorphism test. The new paradigm suggests using subgraphs extracted from the input graph to improve the model's expressiveness, but the additional complexity exacerbates an already challenging problem in GNNs: explaining their predictions. In this work, we adapt PGExplainer, one of the most recent explainers for GNNs, to SGNNs. The proposed explainer accounts for the contribution of all the different subgraphs and can produce a meaningful explanation that humans can interpret. The experiments that we performed both on real and synthetic datasets show that our framework is successful in explaining the decision process of an SGNN on graph classification tasks.
    Concept Discovery for Fast Adapatation. (arXiv:2301.07850v1 [cs.LG])
    The advances in deep learning have enabled machine learning methods to outperform human beings in various areas, but it remains a great challenge for a well-trained model to quickly adapt to a new task. One promising solution to realize this goal is through meta-learning, also known as learning to learn, which has achieved promising results in few-shot learning. However, current approaches are still enormously different from human beings' learning process, especially in the ability to extract structural and transferable knowledge. This drawback makes current meta-learning frameworks non-interpretable and hard to extend to more complex tasks. We tackle this problem by introducing concept discovery to the few-shot learning problem, where we achieve more effective adaptation by meta-learning the structure among the data features, leading to a composite representation of the data. Our proposed method Concept-Based Model-Agnostic Meta-Learning (COMAML) has been shown to achieve consistent improvements in the structured data for both synthesized datasets and real-world datasets.
    Using CycleGANs to Generate Realistic STEM Images for Machine Learning. (arXiv:2301.07743v1 [cond-mat.mtrl-sci])
    The rise of automation and machine learning (ML) in electron microscopy has the potential to revolutionize materials research by enabling the autonomous collection and processing of vast amounts of atomic resolution data. However, a major challenge is developing ML models that can reliably and rapidly generalize to large data sets with varying experimental conditions. To overcome this challenge, we develop a cycle generative adversarial network (CycleGAN) that introduces a novel reciprocal space discriminator to augment simulated data with realistic, complex spatial frequency information learned from experimental data. This enables the CycleGAN to generate nearly indistinguishable images from real experimental data, while also providing labels for further ML applications. We demonstrate the effectiveness of this approach by training a fully convolutional network (FCN) to identify single atom defects in a large data set of 4.5 million atoms, which we collected using automated acquisition in an aberration-corrected scanning transmission electron microscope (STEM). Our approach yields highly adaptable FCNs that can adjust to dynamically changing experimental variables, such as lens aberrations, noise, and local contamination, with minimal manual intervention. This represents a significant step towards building fully autonomous approaches for harnessing microscopy big data.
    Position Regression for Unsupervised Anomaly Detection. (arXiv:2301.08064v1 [cs.CV])
    In recent years, anomaly detection has become an essential field in medical image analysis. Most current anomaly detection methods for medical images are based on image reconstruction. In this work, we propose a novel anomaly detection approach based on coordinate regression. Our method estimates the position of patches within a volume, and is trained only on data of healthy subjects. During inference, we can detect and localize anomalies by considering the error of the position estimate of a given patch. We apply our method to 3D CT volumes and evaluate it on patients with intracranial haemorrhages and cranial fractures. The results show that our method performs well in detecting these anomalies. Furthermore, we show that our method requires less memory than comparable approaches that involve image reconstruction. This is highly relevant for processing large 3D volumes, for instance, CT or MRI scans.
    A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems. (arXiv:2301.07799v1 [cs.LG])
    Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future.
    A Nonstochastic Control Approach to Optimization. (arXiv:2301.07902v1 [cs.LG])
    Tuning optimizer hyperparameters, notably the learning rate to a particular optimization instance, is an important but nonconvex problem. Therefore iterative optimization methods such as hypergradient descent lack global optimality guarantees in general. We propose an online nonstochastic control methodology for mathematical optimization. The choice of hyperparameters for gradient based methods, including the learning rate, momentum parameter and preconditioner, is described as feedback control. The optimal solution to this control problem is shown to encompass preconditioned adaptive gradient methods with varying acceleration and momentum parameters. Although the optimal control problem by itself is nonconvex, we show how recent methods from online nonstochastic control based on convex relaxation can be applied to compete with the best offline solution. This guarantees that in episodic optimization, we converge to the best optimization method in hindsight.
    Identification, explanation and clinical evaluation of hospital patient subtypes. (arXiv:2301.08019v1 [cs.LG])
    We present a pipeline in which unsupervised machine learning techniques are used to automatically identify subtypes of hospital patients admitted between 2017 and 2021 in a large UK teaching hospital. With the use of state-of-the-art explainability techniques, the identified subtypes are interpreted and assigned clinical meaning. In parallel, clinicians assessed intra-cluster similarities and inter-cluster differences of the identified patient subtypes within the context of their clinical knowledge. By confronting the outputs of both automatic and clinician-based explanations, we aim to highlight the mutual benefit of combining machine learning techniques with clinical expertise.
    Discover governing differential equations from evolving systems. (arXiv:2301.07863v1 [physics.comp-ph])
    Discovering the governing equations of evolving systems from available observations is essential and challenging. However, current methods does not capture the situation that underlying system dynamics can be changed.Evolving systems are changing over time, which invariably changes with system status. Thus, finding the exact change points is critical. We propose an online modeling method capable of handling samples one by one sequentially by modeling streaming data instead of processing the entire dataset. The proposed method performs well in discovering ordinary differential equations, partial differential equations (PDEs), and high-dimensional PDEs from streaming data. The measurement generated from a changed system is distributed dissimilarly to before; hence, the difference can be identified by the proposed method. Our proposal performs well in identifying the change points and discovering governing differential equations in two evolving systems.
    Catapult Dynamics and Phase Transitions in Quadratic Nets. (arXiv:2301.07737v1 [cs.LG])
    Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.
    ClusterLog: Clustering Logs for Effective Log-based Anomaly Detection. (arXiv:2301.07846v1 [cs.DC])
    With the increasing prevalence of scalable file systems in the context of High Performance Computing (HPC), the importance of accurate anomaly detection on runtime logs is increasing. But as it currently stands, many state-of-the-art methods for log-based anomaly detection, such as DeepLog, have encountered numerous challenges when applied to logs from many parallel file systems (PFSes), often due to their irregularity and ambiguity in time-based log sequences. To circumvent these problems, this study proposes ClusterLog, a log pre-processing method that clusters the temporal sequence of log keys based on their semantic similarity. By grouping semantically and sentimentally similar logs, this approach aims to represent log sequences with the smallest amount of unique log keys, intending to improve the ability of a downstream sequence-based model to effectively learn the log patterns. The preliminary results of ClusterLog indicate not only its effectiveness in reducing the granularity of log sequences without the loss of important sequence information but also its generalizability to different file systems' logs.
    From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition. (arXiv:2301.07851v1 [cs.SD])
    In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can \textbf{re-purpose} well-trained English automatic speech recognition (ASR) models to recognize the other languages. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement that, for the first time, empowers model reprogramming on ASR. Specifically, we investigate how to select trainable components (i.e., encoder) of a conformer-based RNN-Transducer, as a frozen pre-trained backbone. Experiments on a seven-language multilingual LibriSpeech speech (MLS) task show that model reprogramming only requires 4.2% (11M out of 270M) to 6.8% (45M out of 660M) of its original trainable parameters from a full ASR model to perform competitive results in a range of 11.9% to 8.1% WER averaged across different languages. In addition, we discover different setups to make large-scale pre-trained ASR succeed in both monolingual and multilingual speech recognition. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses (e.g., w2v-bert) in terms of lower WER and better training efficiency.
    Suboptimality analysis of receding horizon quadratic control with unknown linear systems and its applications in learning-based control. (arXiv:2301.07876v1 [eess.SY])
    For a receding-horizon controller with a known system and with an approximate terminal value function, it is well-known that increasing the prediction horizon can improve its control performance. However, when the prediction model is inexact, a larger prediction horizon also causes propagation and accumulation of the prediction error. In this work, we aim to analyze the effect of the above trade-off between the modeling error, the terminal value function error, and the prediction horizon on the performance of a nominal receding-horizon linear quadratic (LQ) controller. By developing a novel perturbation result of the Riccati difference equation, a performance upper bound is obtained and suggests that for many cases, the prediction horizon should be either 1 or infinity to improve the control performance, depending on the relative difference between the modeling error and the terminal value function error. The obtained suboptimality performance bound is also applied to provide end-to-end performance guarantees, e.g., regret bounds, for nominal receding-horizon LQ controllers in a learning-based setting.
    HCE: Improving Performance and Efficiency with Heterogeneously Compressed Neural Network Ensemble. (arXiv:2301.07794v1 [cs.LG])
    Ensemble learning has gain attention in resent deep learning research as a way to further boost the accuracy and generalizability of deep neural network (DNN) models. Recent ensemble training method explores different training algorithms or settings on multiple sub-models with the same model architecture, which lead to significant burden on memory and computation cost of the ensemble model. Meanwhile, the heurtsically induced diversity may not lead to significant performance gain. We propose a new prespective on exploring the intrinsic diversity within a model architecture to build efficient DNN ensemble. We make an intriguing observation that pruning and quantization, while both leading to efficient model architecture at the cost of small accuracy drop, leads to distinct behavior in the decision boundary. To this end, we propose Heterogeneously Compressed Ensemble (HCE), where we build an efficient ensemble with the pruned and quantized variants from a pretrained DNN model. An diversity-aware training objective is proposed to further boost the performance of the HCE ensemble. Experiemnt result shows that HCE achieves significant improvement in the efficiency-accuracy tradeoff comparing to both traditional DNN ensemble training methods and previous model compression methods.
    FE-TCM: Filter-Enhanced Transformer Click Model for Web Search. (arXiv:2301.07854v1 [cs.IR])
    Constructing click models and extracting implicit relevance feedback information from the interaction between users and search engines are very important to improve the ranking of search results. Using neural network to model users' click behaviors has become one of the effective methods to construct click models. In this paper, We use Transformer as the backbone network of feature extraction, add filter layer innovatively, and propose a new Filter-Enhanced Transformer Click Model (FE-TCM) for web search. Firstly, in order to reduce the influence of noise on user behavior data, we use the learnable filters to filter log noise. Secondly, following the examination hypothesis, we model the attraction estimator and examination predictor respectively to output the attractiveness scores and examination probabilities. A novel transformer model is used to learn the deeper representation among different features. Finally, we apply the combination functions to integrate attractiveness scores and examination probabilities into the click prediction. From our experiments on two real-world session datasets, it is proved that FE-TCM outperforms the existing click models for the click prediction.
    A Scalable Finite Difference Method for Deep Reinforcement Learning. (arXiv:2210.07487v2 [cs.LG] UPDATED)
    Several low-bandwidth distributable black-box optimization algorithms in the family of finite differences such as Evolution Strategies have recently been shown to perform nearly as well as tailored Reinforcement Learning methods in some Reinforcement Learning domains. One shortcoming of these black-box methods is that they must collect information about the structure of the return function at every update, and can often employ only information drawn from a distribution centered around the current parameters. As a result, when these algorithms are distributed across many machines, a significant portion of total runtime may be spent with many machines idle, waiting for a final return and then for an update to be calculated. In this work we introduce a novel method to use older data in finite difference algorithms, which produces a scalable algorithm that avoids significant idle time or wasted computation.
    RNAS-CL: Robust Neural Architecture Search by Cross-Layer Knowledge Distillation. (arXiv:2301.08092v1 [cs.CV])
    Deep Neural Networks are vulnerable to adversarial attacks. Neural Architecture Search (NAS), one of the driving tools of deep neural networks, demonstrates superior performance in prediction accuracy in various machine learning applications. However, it is unclear how it performs against adversarial attacks. Given the presence of a robust teacher, it would be interesting to investigate if NAS would produce robust neural architecture by inheriting robustness from the teacher. In this paper, we propose Robust Neural Architecture Search by Cross-Layer Knowledge Distillation (RNAS-CL), a novel NAS algorithm that improves the robustness of NAS by learning from a robust teacher through cross-layer knowledge distillation. Unlike previous knowledge distillation methods that encourage close student/teacher output only in the last layer, RNAS-CL automatically searches for the best teacher layer to supervise each student layer. Experimental result evidences the effectiveness of RNAS-CL and shows that RNAS-CL produces small and robust neural architecture.
    SpotHitPy: A Study For ML-Based Song Hit Prediction Using Spotify. (arXiv:2301.07978v1 [cs.SD])
    In this study, we approached the Hit Song Prediction problem, which aims to predict which songs will become Billboard hits. We gathered a dataset of nearly 18500 hit and non-hit songs and extracted their audio features using the Spotify Web API. We test four machine-learning models on our dataset. We were able to predict the Billboard success of a song with approximately 86\% accuracy. The most succesful algorithms were Random Forest and Support Vector Machine.
    General Greedy De-bias Learning. (arXiv:2112.10572v5 [cs.LG] UPDATED)
    Neural networks often make predictions relying on the spurious correlations from the datasets rather than the intrinsic properties of the task of interest, facing sharp degradation on out-of-distribution (OOD) test data. Existing de-bias learning frameworks try to capture specific dataset bias by annotations but they fail to handle complicated OOD scenarios. Others implicitly identify the dataset bias by special design low capability biased models or losses, but they degrade when the training and testing data are from the same distribution. In this paper, we propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model. The base model is encouraged to focus on examples that are hard to solve with biased models, thus remaining robust against spurious correlations in the test stage. GGD largely improves models' OOD generalization ability on various tasks, but sometimes over-estimates the bias level and degrades on the in-distribution test. We further re-analyze the ensemble process of GGD and introduce the Curriculum Regularization inspired by curriculum learning, which achieves a good trade-off between in-distribution and out-of-distribution performance. Extensive experiments on image classification, adversarial question answering, and visual question answering demonstrate the effectiveness of our method. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
    Learning Quantum Processes with Memory -- Quantum Recurrent Neural Networks. (arXiv:2301.08167v1 [quant-ph])
    Recurrent neural networks play an important role in both research and industry. With the advent of quantum machine learning, the quantisation of recurrent neural networks has become recently relevant. We propose fully quantum recurrent neural networks, based on dissipative quantum neural networks, capable of learning general causal quantum automata. A quantum training algorithm is proposed and classical simulations for the case of product outputs with the fidelity as cost function are carried out. We thereby demonstrate the potential of these algorithms to learn complex quantum processes with memory in terms of the exemplary delay channel, the time evolution of quantum states governed by a time-dependent Hamiltonian, and high- and low-frequency noise mitigation. Numerical simulations indicate that our quantum recurrent neural networks exhibit a striking ability to generalise from small training sets.
    Augmenting a Physics-Informed Neural Network for the 2D Burgers Equation by Addition of Solution Data Points. (arXiv:2301.07824v1 [physics.flu-dyn])
    We implement a Physics-Informed Neural Network (PINN) for solving the two-dimensional Burgers equations. This type of model can be trained with no previous knowledge of the solution; instead, it relies on evaluating the governing equations of the system in points of the physical domain. It is also possible to use points with a known solution during training. In this paper, we compare PINNs trained with different amounts of governing equation evaluation points and known solution points. Comparing models that were trained purely with known solution points to those that have also used the governing equations, we observe an improvement in the overall observance of the underlying physics in the latter. We also investigate how changing the number of each type of point affects the resulting models differently. Finally, we argue that the addition of the governing equations during training may provide a way to improve the overall performance of the model without relying on additional data, which is especially important for situations where the number of known solution points is limited.
    Improving Machine Translation with Phrase Pair Injection and Corpus Filtering. (arXiv:2301.08008v1 [cs.CL])
    In this paper, we show that the combination of Phrase Pair Injection and Corpus Filtering boosts the performance of Neural Machine Translation (NMT) systems. We extract parallel phrases and sentences from the pseudo-parallel corpus and augment it with the parallel corpus to train the NMT models. With the proposed approach, we observe an improvement in the Machine Translation (MT) system for 3 low-resource language pairs, Hindi-Marathi, English-Marathi, and English-Pashto, and 6 translation directions by up to 2.7 BLEU points, on the FLORES test data. These BLEU score improvements are over the models trained using the whole pseudo-parallel corpus augmented with the parallel corpus.
    WaveMix: A Resource-efficient Neural Network for Image Analysis. (arXiv:2205.14375v3 [cs.CV] UPDATED)
    To allow image analysis in resource-constrained scenarios without compromising generalizability, we introduce WaveMix -- a novel and flexible neural framework that reduces the GPU RAM (memory) and compute (latency) compared to CNNs and transformers. In addition to using convolutional layers that exploit shift-invariant image statistics, the proposed framework uses multi-level two-dimensional discrete wavelet transform (2D-DWT) modules to exploit scale-invariance and edge sparseness, which gives it the following advantages. Firstly, the fixed weights of wavelet modules do not add to the parameter count while reorganizing information based on these image priors. Secondly, the wavelet modules scale the spatial extents of feature maps by integral powers of $\frac{1}{2}\times\frac{1}{2}$, which reduces the memory and latency required for forward and backward passes. Finally, a multi-level 2D-DWT leads to a quicker expansion of the receptive field per layer than pooling (which we do not use) and it is a more effective spatial token mixer. WaveMix also generalizes better than other token mixing models, such as ConvMixer, MLP-Mixer, PoolFormer, random filters, and Fourier basis, because the wavelet transform is much better suited for image decomposition and spatial token mixing. WaveMix is a flexible model that can perform well on multiple image tasks without needing architectural modifications. WaveMix achieves a semantic segmentation mIoU of 83% on the Cityscapes validation set outperforming transformer and CNN-based architectures. We also demonstrate the advantages of WaveMix for classification on multiple datasets and show that WaveMix establishes new state-of-the-results in Places-365, EMNIST, and iNAT-mini datasets.
    PDFormer: Propagation Delay-aware Dynamic Long-range Transformer for Traffic Flow Prediction. (arXiv:2301.07945v1 [cs.LG])
    As a core technology of Intelligent Transportation System, traffic flow prediction has a wide range of applications. The fundamental challenge in traffic flow prediction is to effectively model the complex spatial-temporal dependencies in traffic data. Spatial-temporal Graph Neural Network (GNN) models have emerged as one of the most promising methods to solve this problem. However, GNN-based models have three major limitations for traffic prediction: i) Most methods model spatial dependencies in a static manner, which limits the ability to learn dynamic urban traffic patterns; ii) Most methods only consider short-range spatial information and are unable to capture long-range spatial dependencies; iii) These methods ignore the fact that the propagation of traffic conditions between locations has a time delay in traffic systems. To this end, we propose a novel Propagation Delay-aware dynamic long-range transFormer, namely PDFormer, for accurate traffic flow prediction. Specifically, we design a spatial self-attention module to capture the dynamic spatial dependencies. Then, two graph masking matrices are introduced to highlight spatial dependencies from short- and long-range views. Moreover, a traffic delay-aware feature transformation module is proposed to empower PDFormer with the capability of explicitly modeling the time delay of spatial information propagation. Extensive experimental results on six real-world public traffic datasets show that our method can not only achieve state-of-the-art performance but also exhibit competitive computational efficiency. Moreover, we visualize the learned spatial-temporal attention map to make our model highly interpretable.
    Learning Generalizable Models for Vehicle Routing Problems via Knowledge Distillation. (arXiv:2210.07686v2 [cs.LG] UPDATED)
    Recent neural methods for vehicle routing problems always train and test the deep models on the same instance distribution (i.e., uniform). To tackle the consequent cross-distribution generalization concerns, we bring the knowledge distillation to this field and propose an Adaptive Multi-Distribution Knowledge Distillation (AMDKD) scheme for learning more generalizable deep models. Particularly, our AMDKD leverages various knowledge from multiple teachers trained on exemplar distributions to yield a light-weight yet generalist student model. Meanwhile, we equip AMDKD with an adaptive strategy that allows the student to concentrate on difficult distributions, so as to absorb hard-to-master knowledge more effectively. Extensive experimental results show that, compared with the baseline neural methods, our AMDKD is able to achieve competitive results on both unseen in-distribution and out-of-distribution instances, which are either randomly synthesized or adopted from benchmark datasets (i.e., TSPLIB and CVRPLIB). Notably, our AMDKD is generic, and consumes less computational resources for inference.
    An SDE for Modeling SAM: Theory and Insights. (arXiv:2301.08203v1 [cs.LG])
    We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and its unnormalized variant USAM, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the step size). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones - by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that perhaps unexpectedly SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
    Fully Elman Neural Network: A Novel Deep Recurrent Neural Network Optimized by an Improved Harris Hawks Algorithm for Classification of Pulmonary Arterial Wedge Pressure. (arXiv:2301.07710v1 [cs.LG])
    Heart failure (HF) is one of the most prevalent life-threatening cardiovascular diseases in which 6.5 million people are suffering in the USA and more than 23 million worldwide. Mechanical circulatory support of HF patients can be achieved by implanting a left ventricular assist device (LVAD) into HF patients as a bridge to transplant, recovery or destination therapy and can be controlled by measurement of normal and abnormal pulmonary arterial wedge pressure (PAWP). While there are no commercial long-term implantable pressure sensors to measure PAWP, real-time non-invasive estimation of abnormal and normal PAWP becomes vital. In this work, first an improved Harris Hawks optimizer algorithm called HHO+ is presented and tested on 24 unimodal and multimodal benchmark functions. Second, a novel fully Elman neural network (FENN) is proposed to improve the classification performance. Finally, four novel 18-layer deep learning methods of convolutional neural networks (CNNs) with multi-layer perceptron (CNN-MLP), CNN with Elman neural networks (CNN-ENN), CNN with fully Elman neural networks (CNN-FENN), and CNN with fully Elman neural networks optimized by HHO+ algorithm (CNN-FENN-HHO+) for classification of abnormal and normal PAWP using estimated HVAD pump flow were developed and compared. The estimated pump flow was derived by a non-invasive method embedded into the commercial HVAD controller. The proposed methods are evaluated on an imbalanced clinical dataset using 5-fold cross-validation. The proposed CNN-FENN-HHO+ method outperforms the proposed CNN-MLP, CNN-ENN and CNN-FENN methods and improved the classification performance metrics across 5-fold cross-validation. The proposed methods can reduce the likelihood of hazardous events like pulmonary congestion and ventricular suction for HF patients and notify identified abnormal cases to the hospital, clinician and cardiologist.
    Skeleton Clustering: Dimension-Free Density-based Clustering. (arXiv:2104.10770v2 [stat.ML] UPDATED)
    We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
    Emergence of the SVD as an interpretable factorization in deep learning for inverse problems. (arXiv:2301.07820v1 [cs.LG])
    We demonstrate the emergence of weight matrix singular value decomposition (SVD) in interpreting neural networks (NNs) for parameter estimation from noisy signals. The SVD appears naturally as a consequence of initial application of a descrambling transform - a recently-developed technique for addressing interpretability in NNs \cite{amey2021neural}. We find that within the class of noisy parameter estimation problems, the SVD may be the means by which networks memorize the signal model. We substantiate our theoretical findings with empirical evidence from both linear and non-linear settings. Our results also illuminate the connections between a mathematical theory of semantic development \cite{saxe2019mathematical} and neural network interpretability.
    Spatio-temporal neural structural causal models for bike flow prediction. (arXiv:2301.07843v1 [cs.LG])
    As a representative of public transportation, the fundamental issue of managing bike-sharing systems is bike flow prediction. Recent methods overemphasize the spatio-temporal correlations in the data, ignoring the effects of contextual conditions on the transportation system and the inter-regional timevarying causality. In addition, due to the disturbance of incomplete observations in the data, random contextual conditions lead to spurious correlations between data and features, making the prediction of the model ineffective in special scenarios. To overcome this issue, we propose a Spatio-temporal Neural Structure Causal Model(STNSCM) from the perspective of causality. First, we build a causal graph to describe the traffic prediction, and further analyze the causal relationship between the input data, contextual conditions, spatiotemporal states, and prediction results. Second, we propose to apply the frontdoor criterion to eliminate confounding biases in the feature extraction process. Finally, we propose a counterfactual representation reasoning module to extrapolate the spatio-temporal state under the factual scenario to future counterfactual scenarios to improve the prediction performance. Experiments on real-world datasets demonstrate the superior performance of our model, especially its resistance to fluctuations caused by the external environment. The source code and data will be released.
  • Open

    Global Nash Equilibrium in Non-convex Multi-player Game: Theory and Algorithms. (arXiv:2301.08015v1 [cs.GT])
    Wide machine learning tasks can be formulated as non-convex multi-player games, where Nash equilibrium (NE) is an acceptable solution to all players, since no one can benefit from changing its strategy unilaterally. Attributed to the non-convexity, obtaining the existence condition of global NE is challenging, let alone designing theoretically guaranteed realization algorithms. This paper takes conjugate transformation to the formulation of non-convex multi-player games, and casts the complementary problem into a variational inequality (VI) problem with a continuous pseudo-gradient mapping. We then prove the existence condition of global NE: the solution to the VI problem satisfies a duality relation. Based on this VI formulation, we design a conjugate-based ordinary differential equation (ODE) to approach global NE, which is proved to have an exponential convergence rate. To make the dynamics more implementable, we further derive a discretized algorithm. We apply our algorithm to two typical scenarios: multi-player generalized monotone game and multi-player potential game. In the two settings, we prove that the step-size setting is required to be $\mathcal{O}(1/k)$ and $\mathcal{O}(1/\sqrt k)$ to yield the convergence rates of $\mathcal{O}(1/ k)$ and $\mathcal{O}(1/\sqrt k)$, respectively. Extensive experiments in robust neural network training and sensor localization are in full agreement with our theory.
    Skeleton Clustering: Dimension-Free Density-based Clustering. (arXiv:2104.10770v2 [stat.ML] UPDATED)
    We introduce a density-based clustering method called skeleton clustering that can detect clusters in multivariate and even high-dimensional data with irregular shapes. To bypass the curse of dimensionality, we propose surrogate density measures that are less dependent on the dimension but have intuitive geometric interpretations. The clustering framework constructs a concise representation of the given data as an intermediate step and can be thought of as a combination of prototype methods, density-based clustering, and hierarchical clustering. We show by theoretical analysis and empirical studies that the skeleton clustering leads to reliable clusters in multivariate and high-dimensional scenarios.
    Shapley Values with Uncertain Value Functions. (arXiv:2301.08086v1 [cs.LG])
    We propose a novel definition of Shapley values with uncertain value functions based on first principles using probability theory. Such uncertain value functions can arise in the context of explainable machine learning as a result of non-deterministic algorithms. We show that random effects can in fact be absorbed into a Shapley value with a noiseless but shifted value function. Hence, Shapley values with uncertain value functions can be used in analogy to regular Shapley values. However, their reliable evaluation typically requires more computational effort.
    Score-based Causal Representation Learning with Interventions. (arXiv:2301.08230v1 [stat.ML])
    This paper studies causal representation learning problem when the latent causal variables are observed indirectly through an unknown linear transformation. The objectives are: (i) recovering the unknown linear transformation (up to scaling and ordering), and (ii) determining the directed acyclic graph (DAG) underlying the latent variables. Since identifiable representation learning is impossible based on only observational data, this paper uses both observational and interventional data. The interventional data is generated under distinct single-node randomized hard and soft interventions. These interventions are assumed to cover all nodes in the latent space. It is established that the latent DAG structure can be recovered under soft randomized interventions via the following two steps. First, a set of transformation candidates is formed by including all inverting transformations corresponding to which the \emph{score} function of the transformed variables has the minimal number of coordinates that change between an interventional and the observational environment summed over all pairs. Subsequently, this set is distilled using a simple constraint to recover the latent DAG structure. For the special case of hard randomized interventions, with an additional hypothesis testing step, one can also uniquely recover the linear transformation, up to scaling and a valid causal ordering. These results generalize the recent results that either assume deterministic hard interventions or linear causal relationships in the latent space.
    Learning to Rank by Causal Effects Without Data to Accurately Estimate Causal Effects. (arXiv:2206.12532v2 [stat.ML] UPDATED)
    Decision makers often want to identify the individuals for whom some intervention or treatment will be most effective in order to decide who to treat. In such cases, decision makers would ideally like to rank potential recipients of the treatment according to their individual causal effects. However, the available data may be completely inadequate to estimate causal effects accurately. We formalize a new assumption -- the rank preservation assumption (RPA) -- that defines when data are suitable to learn how to rank individuals according to their causal effects, even if the effects themselves cannot be accurately estimated. The RPA holds when there is data to estimate a scoring variable that induces the same ranking of individuals as the causal effect of interest. Some of the scoring variables we consider are confounded estimates, proxy causal effects, and non-causal quantities. We show that such scoring variables can work well for treatment assignment if the RPA is met, and potentially even better than using causal effects as scores. We also show that the RPA holds under conditions that are more general and weaker than the typical assumptions made in observational studies. Finally, we showcase how practitioners can apply and evaluate alternative scoring models (including non-causal models) to maximize the causal impact of their targeting decisions.
    Everything is Connected: Graph Neural Networks. (arXiv:2301.08210v1 [cs.LG])
    In many ways, graphs are the main modality of data we receive from nature. This is due to the fact that most of the patterns we see, both in natural and artificial systems, are elegantly representable using the language of graph structures. Prominent examples include molecules (represented as graphs of atoms and bonds), social networks and transportation networks. This potential has already been seen by key scientific and industrial groups, with already-impacted application areas including traffic forecasting, drug discovery, social network analysis and recommender systems. Further, some of the most successful domains of application for machine learning in previous years -- images, text and speech processing -- can be seen as special cases of graph representation learning, and consequently there has been significant exchange of information between these areas. The main aim of this short survey is to enable the reader to assimilate the key concepts in the area, and position graph representation learning in a proper context with related fields.
    Diffusion-based Conditional ECG Generation with Structured State Space Models. (arXiv:2301.08227v1 [eess.SP])
    Synthetic data generation is a promising solution to address privacy issues with the distribution of sensitive health data. Recently, diffusion models have set new standards for generative models for different data modalities. Also very recently, structured state space models emerged as a powerful modeling paradigm to capture long-term dependencies in time series. We put forward SSSD-ECG, as the combination of these two technologies, for the generation of synthetic 12-lead electrocardiograms conditioned on more than 70 ECG statements. Due to a lack of reliable baselines, we also propose conditional variants of two state-of-the-art unconditional generative models. We thoroughly evaluate the quality of the generated samples, by evaluating pretrained classifiers on the generated data and by evaluating the performance of a classifier trained only on synthetic data, where SSSD-ECG clearly outperforms its GAN-based competitors. We demonstrate the soundness of our approach through further experiments, including conditional class interpolation and a clinical Turing test demonstrating the high quality of the SSSD-ECG samples across a wide range of conditions.
    Equivalence relations and $L^p$ distances between time series with application to the Black Summer Australian bushfires. (arXiv:2002.02592v2 [stat.ME] UPDATED)
    This paper introduces a new framework of algebraic equivalence relations between time series and new distance metrics between them, then applies these to investigate the Australian ``Black Summer'' bushfire season of 2019-2020. First, we introduce a general framework for defining equivalence between time series, heuristically intended to be equivalent if they differ only up to noise. Our first specific implementation is based on using change point algorithms and comparing statistical quantities such as mean or variance in stationary segments. We thus derive the existence of such equivalence relations on the space of time series, such that the quotient spaces can be equipped with a metrizable topology. Next, we illustrate specifically how to define and compute such distances among a collection of time series and perform clustering and additional analysis thereon. Then, we apply these insights to analyze air quality data across New South Wales, Australia, during the 2019-2020 bushfires. There, we investigate structural similarity with respect to this data and identify locations that were impacted anonymously by the fires relative to their location. This may have implications regarding the appropriate management of resources to avoid gaps in the defense against future fires.
    Semiparametric inference using fractional posteriors. (arXiv:2301.08158v1 [math.ST])
    We establish a general Bernstein--von Mises theorem for approximately linear semiparametric functionals of fractional posterior distributions based on nonparametric priors. This is illustrated in a number of nonparametric settings and for different classes of prior distributions, including Gaussian process priors. We show that fractional posterior credible sets can provide reliable semiparametric uncertainty quantification, but have inflated size. To remedy this, we further propose a \textit{shifted-and-rescaled} fractional posterior set that is an efficient confidence set having optimal size under regularity conditions. As part of our proofs, we also refine existing contraction rate results for fractional posteriors by sharpening the dependence of the rate on the fractional exponent.
    On Measuring Excess Capacity in Neural Networks. (arXiv:2202.08070v3 [cs.LG] UPDATED)
    We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.  ( 2 min )
    Tight Guarantees for Interactive Decision Making with the Decision-Estimation Coefficient. (arXiv:2301.08215v1 [cs.LG])
    A foundational problem in reinforcement learning and interactive decision making is to understand what modeling assumptions lead to sample-efficient learning guarantees, and what algorithm design principles achieve optimal sample complexity. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient (DEC), a measure of statistical complexity which leads to upper and lower bounds on the optimal sample complexity for a general class of problems encompassing bandits and reinforcement learning with function approximation. In this paper, we introduce a new variant of the DEC, the Constrained Decision-Estimation Coefficient, and use it to derive new lower bounds that improve upon prior work on three fronts: - They hold in expectation, with no restrictions on the class of algorithms under consideration. - They hold globally, and do not rely on the notion of localization used by Foster et al. (2021). - Most interestingly, they allow the reference model with respect to which the DEC is defined to be improper, establishing that improper reference models play a fundamental role. We provide upper bounds on regret that scale with the same quantity, thereby closing all but one of the gaps between upper and lower bounds in Foster et al. (2021). Our results apply to both the regret framework and PAC framework, and make use of several new analysis and algorithm design techniques that we anticipate will find broader use.  ( 2 min )
    Differentially Private Online Bayesian Estimation With Adaptive Truncation. (arXiv:2301.08202v1 [cs.LG])
    We propose a novel online and adaptive truncation method for differentially private Bayesian online estimation of a static parameter regarding a population. We assume that sensitive information from individuals is collected sequentially and the inferential aim is to estimate, on-the-fly, a static parameter regarding the population to which those individuals belong. We propose sequential Monte Carlo to perform online Bayesian estimation. When individuals provide sensitive information in response to a query, it is necessary to perturb it with privacy-preserving noise to ensure the privacy of those individuals. The amount of perturbation is proportional to the sensitivity of the query, which is determined usually by the range of the queried information. The truncation technique we propose adapts to the previously collected observations to adjust the query range for the next individual. The idea is that, based on previous observations, we can carefully arrange the interval into which the next individual's information is to be truncated before being perturbed with privacy-preserving noise. In this way, we aim to design predictive queries with small sensitivity, hence small privacy-preserving noise, enabling more accurate estimation while maintaining the same level of privacy. To decide on the location and the width of the interval, we use an exploration-exploitation approach a la Thompson sampling with an objective function based on the Fisher information of the generated observation. We show the merits of our methodology with numerical examples.  ( 2 min )
    A Multi-Resolution Framework for U-Nets with Applications to Hierarchical VAEs. (arXiv:2301.08187v1 [stat.ML])
    U-Net architectures are ubiquitous in state-of-the-art deep learning, however their regularisation properties and relationship to wavelets are understudied. In this paper, we formulate a multi-resolution framework which identifies U-Nets as finite-dimensional truncations of models on an infinite-dimensional function space. We provide theoretical results which prove that average pooling corresponds to projection within the space of square-integrable functions and show that U-Nets with average pooling implicitly learn a Haar wavelet basis representation of the data. We then leverage our framework to identify state-of-the-art hierarchical VAEs (HVAEs), which have a U-Net architecture, as a type of two-step forward Euler discretisation of multi-resolution diffusion processes which flow from a point mass, introducing sampling instabilities. We also demonstrate that HVAEs learn a representation of time which allows for improved parameter efficiency through weight-sharing. We use this observation to achieve state-of-the-art HVAE performance with half the number of parameters of existing models, exploiting the properties of our continuous-time formulation.  ( 2 min )
    Robust Gaussian Process Regression with Huber Likelihood. (arXiv:2301.07858v1 [stat.AP])
    Gaussian process regression in its most simplified form assumes normal homoscedastic noise and utilizes analytically tractable mean and covariance functions of predictive posterior distribution using Gaussian conditioning. Its hyperparameters are estimated by maximizing the evidence, commonly known as type II maximum likelihood estimation. Unfortunately, Bayesian inference based on Gaussian likelihood is not robust to outliers, which are often present in the observational training data sets. To overcome this problem, we propose a robust process model in the Gaussian process framework with the likelihood of observed data expressed as the Huber probability distribution. The proposed model employs weights based on projection statistics to scale residuals and bound the influence of vertical outliers and bad leverage points on the latent functions estimates while exhibiting a high statistical efficiency at the Gaussian and thick tailed noise distributions. The proposed method is demonstrated by two real world problems and two numerical examples using datasets with additive errors following thick tailed distributions such as Students t, Laplace, and Cauchy distribution.  ( 2 min )
    Kinetic Langevin MCMC Sampling Without Gradient Lipschitz Continuity -- the Strongly Convex Case. (arXiv:2301.08039v1 [math.PR])
    In this article we consider sampling from log concave distributions in Hamiltonian setting, without assuming that the objective gradient is globally Lipschitz. We propose two algorithms based on monotone polygonal (tamed) Euler schemes, to sample from a target measure, and provide non-asymptotic 2-Wasserstein distance bounds between the law of the process of each algorithm and the target measure. Finally, we apply these results to bound the excess risk optimization error of the associated optimization problem.  ( 2 min )
    Learning-Rate-Free Learning by D-Adaptation. (arXiv:2301.07733v1 [cs.LG])
    The speed of gradient descent for convex Lipschitz functions is highly dependent on the choice of learning rate. Setting the learning rate to achieve the optimal convergence rate requires knowing the distance D from the initial point to the solution set. In this work, we describe a single-loop method, with no back-tracking or line searches, which does not require knowledge of $D$ yet asymptotically achieves the optimal rate of convergence for the complexity class of convex Lipschitz functions. Our approach is the first parameter-free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. Our method is practical, efficient and requires no additional function value or gradient evaluations each step. An open-source implementation is available (https://github.com/facebookresearch/dadaptation).  ( 2 min )
    Rates of convergence for density estimation with generative adversarial networks. (arXiv:2102.00199v3 [math.ST] UPDATED)
    In this work we undertake a thorough study of the non-asymptotic properties of the vanilla generative adversarial networks (GANs). We prove a sharp oracle inequality for the Jensen-Shannon (JS) divergence between the underlying density $\mathsf{p}^*$ and the GAN estimate. We also study the rates of convergence in the context of nonparametric density estimation. In particular, we show that the JS-divergence between the GAN estimate and $\mathsf{p}^*$ decays as fast as $(\log{n}/n)^{2\beta/(2\beta+d)}$ where $n$ is the sample size and $\beta$ determines the smoothness of $\mathsf{p}^*$. To the best of our knowledge, this is the first result in the literature on density estimation using vanilla GANs with JS convergence rates faster than $n^{-1/2}$ in the regime $\beta > d/2$. Moreover, we show that the obtained rate is minimax optimal (up to logarithmic factors) for the considered class of densities.  ( 2 min )
    Catapult Dynamics and Phase Transitions in Quadratic Nets. (arXiv:2301.07737v1 [cs.LG])
    Neural networks trained with gradient descent can undergo non-trivial phase transitions as a function of the learning rate. In (Lewkowycz et al., 2020) it was discovered that wide neural nets can exhibit a catapult phase for super-critical learning rates, where the training loss grows exponentially quickly at early times before rapidly decreasing to a small value. During this phase the top eigenvalue of the neural tangent kernel (NTK) also undergoes significant evolution. In this work, we will prove that the catapult phase exists in a large class of models, including quadratic models and two-layer, homogenous neural nets. To do this, we show that for a certain range of learning rates the weight norm decreases whenever the loss becomes large. We also empirically study learning rates beyond this theoretically derived range and show that the activation map of ReLU nets trained with super-critical learning rates becomes increasingly sparse as we increase the learning rate.  ( 2 min )
    Understanding the diffusion models by conditional expectations. (arXiv:2301.07882v1 [cs.LG])
    This paper provide several mathematical analyses of the diffusion model in machine learning. The drift term of the backwards sampling process is represented as a conditional expectation involving the data distribution and the forward diffusion. The training process aims to find such a drift function by minimizing the mean-squared residue related to the conditional expectation. Using small-time approximations of the Green's function of the forward diffusion, we show that the analytical mean drift function in DDPM and the score function in SGM asymptotically blow up in the final stages of the sampling process for singular data distributions such as those concentrated on lower-dimensional manifolds, and is therefore difficult to approximate by a network. To overcome this difficulty, we derive a new target function and associated loss, which remains bounded even for singular data distributions. We illustrate the theoretical findings with several numerical examples.  ( 2 min )
  • Open

    measure for difference between two distribution
    Hi, I'm looking for a metric that will describe the difference between 2 distributions. This is being used for classification/model selection, to pick a model that is most similar to a theoretical distribution. The distributions are empiric, that is from a non parametric bootstrap and a simulation. The differences in the distributions could be in the mean or the variance (but neither need to be normal). I've looked at Kullback-Lieber, but that is intended for a cumulative probability distribution, and that pretty much removes the mean effect since the sum of probabilities must = 1. Had some success with the kolmogorov smirnov distance, which seems useful, as well as the jensen shannon divergence and the Wasserstein distance. The Jeffreys divergence really seems like what I want, but doesn't see to exist for numerical/empiric distributions in 1 dimension. It seems that most of the metrics for differences in distributions are for probability distributions. My distributions are not that, they may, for example have a range of 1000-4000, not 0-1. Any other ideas? thanks Mark submitted by /u/marksale11 [link] [comments]  ( 41 min )

  • Open

    [D] Object detection or image classification? Training a model to recognize playing cards
    Hi all, I have been experimenting with object detection recently, using Faster R-CNN and YOLOv7 to train models on pre-existing datasets. Using a UNO card dataset I was able to quite accurately detect the type of UNO cards, based on the symbol in the top left corner. I used an object detection approach, with UNO cards only being categorized into 14 classes. Based on that, I am wondering what the best approach would be to enhance the model to use for other and more comprehensive card games. Thinking of card games like Munchkin for example, which has 1000s of different cards. For card games like this, object detection might not be the best approach having 1000s of different classes to consider. ​ The two different approaches I am considering: Using object detection, create as many classes as there are different playing cards in the game, training the model to detect every single card individually or Using object detection, use playing cards to train the model to detect the playing card itself, then use the detected playing card as input for an image classification algorithm ​ For me there are pros and cons to both methods: The first approach might be much more accurate, as it detects each card individually. On the other hand, it seems to me that it needs considerably more classes and data to feed into those classes. It also might be difficult to expand the model with more unique cards, as you would have to rerun the model every time. The second approach might not be as accurate, as it might not only detect playing cards but also identify other objects as playing cards. On the flip side, it seems to me that it is much easier to expand the model with more unique cards. ​ What might be the best approach here? Do you have a different approach to this, which might be more efficient? submitted by /u/Pallemann [link] [comments]  ( 43 min )
    [D] Speech enhancement - like Adobe Enhance/Audo Studio
    Does anyone here know how Audo / Adobe Enhance work under the hood? Just wondering what open-source tooling already exists of similar quality, likewise with data, and architectures? Would anyone be interested in whipping up something open-source that can be self-hosted? submitted by /u/NegotiationUpbeat545 [link] [comments]  ( 42 min )
    [D] Generate data that is not a dataset
    Hey everyone, I'm currently faced with the challenge of having to generate data that is deliberately not in a dataset. So if you think about the dataset as a distribution, the data points should have a possibly low probability. Additionally, each data point is a 30 dimensional vector and I know the min and max values for each dimension. How do I do that? What kinds of algorithms could I use for that? Can I somehow fit a distribution and sample low probability data points from it? Or a GAN for generating? Or are there obvious classical ML or statistical methods for that? submitted by /u/NiconiusX [link] [comments]  ( 42 min )
    [D] Not sure if time series or multiple classifications?
    I am beginning a problem similar to the one bellow for my work. There is a score 1-4 (1 is bad, 4 is very good) of a persons back sprain recovery. The data we have are back sprain recovery scores recorded after two weeks, 3 months and 6 months, along with information (features) about their behavior like sleep, medications, diet, and exercise. We want to predict there 2 week, 3 month, and 6 month back sprain recovery scores based on their initial behavior inputs. For example, given a user sleeps 8 hours a day, consumes x amount of sugar, does physical therapy 4 days a week, and takes x medication, what will there recovery scores be at 2 weeks, 3 months and 6 months? The training data would look like: ​ Sleep Average Medication Days of Physical Therapy Diet Week 2 recovery score Month 3 recovery score Month 6 recovery score 9 hours per night Advil 4 days/ week Healthy 2 3 4 5 hours per night None 0 days/week Unhealthy 1 2 2 ​ I want a model (or multiple models) to predict 3 values which is the 2 week, 3 month, and 6 month scores. I am not familiar with time series, but it seems like the data may be too sparse. Should I be using time series here, or should I create 3 classification models? submitted by /u/spiritualquestions [link] [comments]  ( 43 min )
    [R] Is there a way to combine a knowledge graph and other types of data for ML purposes?
    Hello, I really don't know how to frame this question but I wanted to ask if the was a way to integrate the relationships and nodes of a knowledge graph with recorded data. Like for example, when a knowledge graph contains information about relationships between features, can it be integrated with a dataset containing recorded or measured quantities of those features. The goal of this is to "infuse" the recorded dataset with relationships already known in the knowledge graph for some data analysis purpose. I know it sounds confusing but you can as for clarification on some details. Please help. submitted by /u/Low-Mood3229 [link] [comments]  ( 43 min )
    [D] Discrete vs. Continuous Normalizing Flows
    I'm working on developing methods for some density estimation and inverse modeling tasks on physics simulation data, and normalizing flow methods seem to be a pretty good tool for this job. I'm right now looking to implement a few different model flavors a la INNs and OT-Flow, and am interested in hearing some perspectives from people in the community who have worked with these kinds of models. What would you consider the current state of the art in normalizing flow methods? Most of what I'm finding in the discrete space seems to have converged on flavors of RealNVP, while OT-Flow seems to be the most advanced in the continuous space. Beyond the benchmark performance metrics tabulated in the literature, what can we say about when to prefer continuous vs. discrete models? It's obviously going to be problem-dependent to some degree, but are there general heuristics to be aware of here? For continuous models (and implicit layer methods more generally), where are the research threads currently at improving runtime performance? The 2020 NeurIPS tutorial on implicit layers (link) has been helpful, but it would be interesting to know how things have advanced since then. Any and all insights would be appreciated! submitted by /u/nuclear_knucklehead [link] [comments]  ( 43 min )
    [N] ESANN 2023 | Special Session on Neuro-Symbolic AI (CFP)
    Neuro-symbolic AI is a promising approach to artificial intelligence that aims to combine the strengths of symbolic reasoning and probabilistic systems. For example, combining inductive logic programming and deep learning with applications in graphs, vision, reasoning and explainability. In this special session, we will provide an overview of neuro-symbolic AI, key concepts, and current state-of-the-art techniques. We will also discuss the potential benefits and challenges of neuro-symbolic AI and its potential impact on various fields and applications. In addition to the tutorial, we welcome contributions from attendees in the context of neuro-symbolic AI. This includes but is not limited to: • Novel neuro-symbolic models and techniques • Applications of neuro-symbolic AI to real-world problems • Empirical evaluations and comparisons of neuro-symbolic AI approaches • Theoretical foundations and analysis of neuro-symbolic AI • Emerging trends and challenges in the field of neuro-symbolic AI ​ Submission guidelines: https://www.esann.org/node/6 Paper submission deadline: 2 May 2023 Conference date: 4-6 October 2023 Conference location: Crowne Plaza hotel Bruges, Belgium submitted by /u/iav_tf_h [link] [comments]  ( 42 min )
    [D] Computationally light-weight deep learning research topics?
    Hello, I am familiar with the theory of differentiable computing and SGD training having done some research work on semi-supervised learning for image classification and semantic/panoptic segmentation. In other words, I am familiar with understanding implementing state-of-the-art proposals as well as tweaking them. Now I'm an unemployed and interested in conducting some research using PyTorch and Google Colab which seems feasible only if the problem or topic at hand is relatively low-cost. So I'm asking the question: What are some deep learning (metalearning,regularization, non-supervised training) or applied DL (CV/NLP/...) topics or datasets that are lightweight enough to be researched with just one GPU? Thanks and have a nice weekend submitted by /u/iamnotlefthanded666 [link] [comments]  ( 42 min )
    [D] "Deep Learning Tuning Playbook" (recently released by Google Brain people)
    https://github.com/google-research/tuning_playbook - Google has released a playbook (solely) about how to tune hyper-parameters of neural networks. Disclaimer: I am unrelated to this repository, just came across it and thought it is suitable for this subreddit. I have searched through and found no posts, thus I post it to hear some comments/insights from you ;) submitted by /u/fzyzcjy [link] [comments]  ( 44 min )
    [D] Did YouTube just add upscaling?
    So, these pictures below are taken from a 144p video on YouTube. You cannot tell me that these aren't CNN upscaling artefacts. So this raises the question of.... how exactly is this implemented? What model are they using which is tiny enough to run on (i assume) WebGL2? Is it a CNN inside of GLSL shaders? Is it something else? CPU side or GPU side? And also... how have I not seen a single other person pointing this out, anywhere on the internet. Believe me I looked. Ain't no one talking about this. EDIT: UPDATE this is doing it in ALL videos in chrome now. It only works in Chrome, not in Discord or Edge, so its not GPU/Windows fuckery. But the strange thing is other friends testing this with the same version of Chrome ***DONT*** have this? And the even stranger thing is... this is running on Intel Integrated Graphics... https://preview.redd.it/jnjwjzyag7da1.png?width=3240&format=png&auto=webp&s=504c9fa6ba41ae3a5b1266fe17e519839d3cf933 https://preview.redd.it/6vzyx5f1g7da1.png?width=1182&format=png&auto=webp&s=b56df5017d1fb742f042c847e42818d7f05a1888 https://preview.redd.it/bo36ko40g7da1.png?width=365&format=png&auto=webp&s=1777e238a7299084da9e10eb62c6c2539dc5cc86 https://preview.redd.it/16zpxwqyf7da1.png?width=333&format=png&auto=webp&s=38b9f2f4eb1a5999ad212aab6247e41a913a5294 submitted by /u/Avelina9X [link] [comments]  ( 46 min )
    [N] OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic
    https://time.com/6247678/openai-chatgpt-kenya-workers/ submitted by /u/ChubChubkitty [link] [comments]  ( 54 min )
    [P] paper-hero: Yet Another Paper Search Tool
    Hi guys, thanks for reading this post. I built a simplistic paper search tool that integrates ACL Anthology, arXiv API, and DBLP API. Github address: Spico197/paper-hero Motivation: I'm majoring NLP and I'd like to search for papers with "Event Extraction" as titles in specific proceedings (e.g. ACL, EMNLP). Challenge: There are lots of search tools and APIs, but few of them provide field-specific searches, like authors, titles, abstracts, and venues. Methodology: I integrate ACL Anthology, arXiv API, and DBLP API, and provide a two-stage search toolkit, which first stores target papers via the official fuzzy search API, and then matches specific fields. Advantages: This tool satisfies my need to stockpile papers and it can dump checklists in markdown format, or complete paper information in jsonl. AND and OR logics are supported in search queries. Limitations: This tool is based on simple string matching, so you have to know some terminologies in the target fields. You are warmly welcome to have a try and feel free to drop me an issue! from src.interfaces.aclanthology import AclanthologyPaperList from src.utils import dump_paper_list_to_markdown_checklist if __name__ == "__main__": # use `bash scripts/get_aclanthology.sh` to download and prepare anthology data first paper_list = AclanthologyPaperList("cache/aclanthology.json") ee_query = { "title": [ # Any of the strings below is matched ["information extraction"], ["event", "extraction"], # title must include `event` and `extraction` ["event", "argument", "extraction"], ["event", "detection"], ["event", "classification"], ["event", "tracking"], ["event", "relation", "extraction"], ], # Besides the title constraint, venue must also meet the needs "venue": [ ["acl"], ["emnlp"], ["naacl"], ["coling"], ["findings"], ["tacl"], ["cl"], ], } ee_papers = paper_list.search(ee_query) dump_paper_list_to_markdown_checklist(ee_papers, "results/ee-paper-list.md") ​ markdown checklist submitted by /u/Spico197 [link] [comments]  ( 44 min )
  • Open

    Hey anyone know where I could find a AI assisted writer with no world limit, or at least a VERY high one
    Title submitted by /u/Zan_korida [link] [comments]  ( 40 min )
    Consider this: You know how chatGPT says load failed when its giving YOU the most groundbreaking answer ever? I bet: It’s sent to OpenAI, you get “Load failed” and a less profound answer on regenerate.
    It’s the free preview. No breakthroughs for you. They collect amazing breakthroughs by the minute from humans working the AI. Just add a weight for profoundness 1-10. At 7, crash and send. User never gets it. Thoughts? submitted by /u/Overall-Importance54 [link] [comments]  ( 40 min )
    AI art - automation. A working artist's take.
    submitted by /u/WSCOKN [link] [comments]  ( 40 min )
    This website was created by an AI chatbot, and all of the content was generated by an AI image generator.
    submitted by /u/FreePixelArt [link] [comments]  ( 40 min )
    Amazon Wants To Help Community Colleges with AI
    Amazon has launched an "educator enablement" program to help instructors at community colleges, HBCUs, and other minority-serving institutions learn and teach AI. The professional development program will help college instructors gain a generalist AI skillset. Amazon will provide $1,200 and continuing education credits to 330 participants who complete one of the six boot camps being offered over the course of 2023. For colleges that don't get selected for the educator enablement cohort, Amazon plans to make curriculum materials for any interested college at no cost through Github, YouTube, and AWS Academy This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/google-brings-in-legendary-duo-for-chatgpt-battle submitted by /u/Mk_Makanaki [link] [comments]  ( 40 min )
    "Sentient AI" - Example Of Just How Easy It Is To Prompt A Fake Sentient AI(GPT3)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 40 min )
    AI WARS: Explaining Google's painfully long, 15k-word! tome about their AI plans... or lack thereof
    We covered this in our newsletter today. Here it is verbatim-- if you find it useful, hit the link and sub: https://smokingrobot.beehiiv.com/p/ai-wars ​ Microsoft has dominated BIG TECH headlines over the last few months, thanks largely to a drumbeat of headlines involving their partner OpenAI and its world-shaping ChatGPT. So awe-striking is Microsoft's hand right now, it has made rival companies' advancements, like Apple's recently announced and insanely powerful M2 MacBook Pros, look pedestrian in comparison. But now Google has entered the chat. And by "entered the chat", we mean that CEO Sundar Pichai - Pich-AI? - released a distressingly long 15,000-word(!) treatise on its own endeavors in AI, signaling a counter attack... maybe... at some point in the future... when and if it…  ( 45 min )
    A ChatGPT software engineer in your pocket
    Having AI tools like ChatGPT is like having a personal software engineer in your pocket. But most people don't know how to craft prompts for code. Here's how you can get AI to write software for you: ​ https://preview.redd.it/tfkoz3x4v8da1.png?width=686&format=png&auto=webp&s=11a88815cb44e68fa2a5f63aa99f5867ac0aa755 submitted by /u/Imagine-your-success [link] [comments]  ( 40 min )
    DREAMBOOTH: 10 MINS TRAINING Inside Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    Powerful Tools to Test and Improve your Chatbot! 🔥
    submitted by /u/Marinuch [link] [comments]  ( 40 min )
    Walking simulators
    For a while Ive been seeing a lot of videos like these: https://youtu.be/wqvAconYgK0 https://youtu.be/qvpXpCvkqbc https://youtu.be/kQ2bqz3HPJE And Ive been wondering what program these might be using, and if its open to the public. If not is there any program that any of you might suggest to achieve identical if not similar results as I have a few ideas on how this could be utilised within a game engine and animation workspace Thanks submitted by /u/SyhrNewo [link] [comments]  ( 40 min )
    Google plans chatbot search engine and 20 new AI products
    submitted by /u/much_successes [link] [comments]  ( 40 min )
    🚀Online Real-Time Volumetric Nerf + Slam
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    Master Thesis about AI-designed jewelry
    Dear all! My master's thesis is coming to an end, and now I am seeking valuable insights and important data into my topic through a survey. This survey aims to understand consumer awareness and purchase intentions regarding jewelry created with the help of artificial intelligence. You can find the link here: https://ucpresearch.qualtrics.com/.../SV_ex2OALp6fPXQ8Qe I would be grateful if you participated in large numbers and provide me with valuable insights on the current topic. Thank you!! submitted by /u/Ok-Rise453 [link] [comments]  ( 40 min )
    ChatGPT Trend
    submitted by /u/Realistic-Plant3957 [link] [comments]  ( 40 min )
    Are TensorFlow and other ML frameworks worth learning in 2023?
    For some explanation, I am more familiar with PyTorch but I wanted to refresh my knowledge of machine learning and deep learning concepts. However, with the recent trend of moving away from TensorFlow and towards PyTorch, I wonder if TensorFlow and other ML frameworks are still worth learning today. I know certain algorithms are exclusively implemented in one or the other framework. I think at least TensorFlow is good since they’re well-documented and put to the test by other experts. But I’m not so sure about the case of newer custom frameworks. If I dive into them, it could be a step back for me especially after reading this article. It talks about newer ML framework launches and how a lot of people try it out at first, but then interest in said frameworks starts to decrease. I know there are a bunch of good custom frameworks out there but it might take more time for new tools to become mainstream or eventually die down. Which is why I’m afraid to use them at the moment. Let me know what you all think! Thanks! submitted by /u/ActionParticular7697 [link] [comments]  ( 41 min )
    Boston Dynamics reveals Atlas AI robot new ability to grip + autonomously manipulate objects | New 3D modeling Geocode AI creates + Edits highly realistic meshes | Breakthrough Text-To-Video "Tune a Video" uses diffusion models to output coherent video
    submitted by /u/SedatelyMake [link] [comments]  ( 40 min )
    Space Locator
    Hey all. My first post here. I’ll keep it as relevant as possible. I want to develop a AI model which will determine that if the space in a room is enough or not. Like it’ll determine whether the space in the room in enough or not. We’ll provide the pictures of the room and it’ll give us the output. I think there might be some pre-trained models out there which might be helpful. Please guide me in this regard if there are some models or where should I start. I’ll be grateful. Thank you so much in advance submitted by /u/h3artb3att [link] [comments]  ( 40 min )
    Porter Robinson music continued by OpenAI Jukebox
    submitted by /u/anoneemoosh [link] [comments]  ( 40 min )
    ChatGPT Accepted As Co-Author On Multiple Research Papers
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Advancements in Natural Language Processing (NLP) and its applications in various industries
    Natural Language Processing (NLP) is a rapidly growing field within the realm of Artificial Intelligence (AI) that is revolutionizing the way we interact with machines. NLP is a branch of AI that deals with the interaction between human language and computers. It is used to analyze, understand and generate human language, and it has a wide range of applications in various industries. One of the most significant advancements in NLP is the development of deep learning algorithms, which have greatly improved the accuracy and efficiency of NLP models. These algorithms have enabled the development of more sophisticated NLP systems, such as those that can understand context and generate human-like responses. One of the most prominent applications of NLP is in the customer service industry. Com…  ( 42 min )
    TextCortex AI: AI Writing Companion - (Our free browser extension)
    submitted by /u/Ruzuyu [link] [comments]  ( 40 min )
    How do you get AI art generators to produce amazing images that look like real art? Take a text-guided diffusion model and feed it the ideal text prompt with the right keywords
    Paper: https://arxiv.org/abs/2209.11711 Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. https://preview.redd.it/ei4wm8tv66da1.png?width=1852&format=png&auto=webp&s=e20c196fe9a543e6b4ec58b3d4f689624db7f95d submitted by /u/Ok_Mine_5742 [link] [comments]  ( 40 min )
    Created by Stable Diffusion
    submitted by /u/NorthTs [link] [comments]  ( 40 min )
  • Open

    Road Map to Machine Learning & Deep Learning
    A Good Road Map To Machine Learning enginner  ( 14 min )
    Sleep disorders: can AI and Digital Twin help?
    According to the National Sleep Foundation, it is estimated that 50–70 million adults in the United States have a sleep disorder.  ( 25 min )
    Top 10 AI Applications in HRM
    Artificial Intelligence (AI) is revolutionizing the way businesses operate, and the field of Human Resource Management (HRM) is no…  ( 7 min )
    Unlocking the Power of Time Series Forecasting: A Step-by-Step Guide with Code Examples in Python
    Time series forecasting is the process of using a model to predict future values of a time series based on its past values. Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )
  • Open

    Exposing Reliability Degradation and Mitigation in Approximate DNNs under Permanent Faults
    submitted by /u/Chipdoc [link] [comments]  ( 40 min )
    Machine vision within airports
    My first idea was to use simple object recognition on the x-ray machines in airports to detect weapons. As blunt objects are hard for humans to detect, and so are plastic weapons. Surely this could be solved with a simple object recognition algorithm. I then started researching airport security, and came across, without a doubt, the safest airport in the world, the Israeli airport. They have so many incredibly invasive processes, such as interviews with highly trained personnel, who are trained to spot a liar, and officers following you around if you are deemed "high-risk" (sidenote: they are also unapologetically racist about who they deem as high-risk). Couldn't both of those processes be automated? The interviewing could be done by a non-trained worker, then have cameras that analyse the person. And the officers following people around could also be done using CCTV and some machine vision. Or even analyse people when they're walking around I'm not saying that the Iranian airport should do this, as it would be a step-down, which they clearly will not accept. Instead could this not be done in western airports, as there have been many reports indicating the lack of success when it comes to them catching anyone? Could they implement poor versions of the Iranian airport security system? ​ P.S. I recently visited Senegal (west Africa). On the way back from Senegal to the UK I was totally unaware that I had a large water bottle in my bag, and the screening was so useless that they never found it. The guy manning the scanner was watching TikTok on his phone. When I landed I then could have gone anywhere in the world without being checked again, which goes to show how much of a waste of time the current airport security is. submitted by /u/Tom_nerd [link] [comments]  ( 41 min )
    Which gpu would be better for training? Linux, PyTorch, I have a gigabyte rtx 2070 and amp FirePro s9300 x2?
    Amd* submitted by /u/DaOnlyBaby [link] [comments]  ( 40 min )
    🚀Online Real-Time Volumetric Nerf + Slam
    submitted by /u/oridnary_artist [link] [comments]  ( 40 min )
    How do you get AI art generators to produce amazing images that look like real art? Take a text-guided diffusion model and feed it the ideal text prompt with the right keywords
    Paper: https://arxiv.org/abs/2209.11711 Abstract: Recent progress in generative models, especially in text-guided diffusion models, has enabled the production of aesthetically-pleasing imagery resembling the works of professional human artists. However, one has to carefully compose the textual description, called the prompt, and augment it with a set of clarifying keywords. Since aesthetics are challenging to evaluate computationally, human feedback is needed to determine the optimal prompt formulation and keyword combination. In this paper, we present a human-in-the-loop approach to learning the most useful combination of prompt keywords using a genetic algorithm. We also show how such an approach can improve the aesthetic appeal of images depicting the same descriptions. https://preview.redd.it/p47ovsfc66da1.png?width=1852&format=png&auto=webp&s=516eda2724c4f254b59109fd46a65dba8ffd79a1 submitted by /u/Ok_Mine_5742 [link] [comments]  ( 40 min )
  • Open

    What are the current state-of-the-art algorithms?
    What algorithms are currently considered state-of-the-art? I’m specifically interested in those which are off-policy as I have found DQN to be the best choice for my application so far. Some algorithms I’m considering trying are R2D2, Duelling DQN and Rainbow DQN. Are there any others that could be worth a look? submitted by /u/centripetalstranger [link] [comments]  ( 40 min )
    In RL, how does one provide a theoretical justification of why one algorithm works better than the other?
    Completely random example, let's say that your experiments consistently demonstrate that a recurrent policy (LSTM) in PPO works better than a linear policy, more specifically in one kind of environments (say, environments that require cooperation between agents). Now, how do you justify theoretically your empirical finding? In other words, how do you explain the theory behind the linear policy's limitations? submitted by /u/No_Possibility_7588 [link] [comments]  ( 43 min )
    DQN for simple toy env
    Dear RL community, ​ I'm trying to get DQN (using the stable baselines 3 implementation) to solve my toy environment. No matter how many different hyperparameter configurations I try, I can't get it to work. ​ Here are some details about the environment (with num_objects=1). - the agent is essentially controlling a gripper that moves to one of the squares, grasps the item, and then moves to the target location. - it's a grid world set on 2 levels with each level of size 3x3 - the available actions are one for each square (the agent is teleported there) and 2 more for grasping and releasing. total of 20 discrete actions. - the reward provides a signal as the distance between the object and the target goal, a big reward if succeeded or -1 when it does something bad. ​ Here's my training code. I've also added a human_step function that can be used to control the gripper yourself. ​ Do you have any insights as to why it's not working? Please ask if you need more details about any part of the implementation. ​ Many thanks! submitted by /u/Mr_Physic13 [link] [comments]  ( 41 min )
    How to proceed scientifically when your hypothesis is falsified?
    I predicted that a certain change in the architecture of my agents would boost their coordination (in the context of multi-agent reinforcement learning). However, I tested this in the Meetup environment and it is not working, in the sense that it performs slightly worse than the baseline. This is how the environment works: three agents must collectively choose one of K landmarks and congregate near it. At each time step, each agent receives reward equal to the change in distance between itself and the landmark closest to all three agents. The goal landmark changes depending on the current position of all agents. When all K agents are adjacent to the same landmark, the agents receive a bonus of 1 and the episode ends. Scientifically speaking, how can I be rigorous about testing this hypothesis again? A few ideas: 1 Repeat the experiment multiple times with different random seeds to ensure that the results are robust and not influenced by random variations. 2 Vary the parameters of the agent Vary the number of modules used in the policy and test the effect on coordination. Increase the number of agents 3 Vary the parameters of the environment Changing the number of landmarks Adding distractors 4 Test another environment What do you think? - submitted by /u/No_Possibility_7588 [link] [comments]  ( 42 min )
    Continuous action space that should often return an exact value that's inputted (attempt with reinforce algorithm)
    I was wondering if this is bad practice or bound to fail. I'm in an environment where the state is x1 plus a bunch of other things call them x2. Sometimes the best action is x1 (and exactly x1). Sometimes the best action is a function of x2 (and x1). What I was thinking (and am trying to so far little success) is to have the reinforce algorithm output two gaussian action values. The first, a1, is f(x1,x2). The second, a2, is put into a sigmoid function and if it's greater than 0.5, use action f(x2,x1) and otherwise x1. It seems weird to me that in the reinforce algorithm I'd still use the log probability of both action values if a2<0.5 in which case a1 doesn't get used at all. In that case should I find the probability only of a2 and use that instead? If this idea is completely off base and you have suggestions please lmk. submitted by /u/JustTaxLandLol [link] [comments]  ( 41 min )
    Environment for General AI using Reinforcement Learning?
    I have many custom environments with these features: Observation, Valid Actions at a specific Observation, Sparse reward when the episode is terminated, either 1 or 0 I want to build a Reinforcement Learning Agent that can perform well in these environments. I started this personal project 2 years ago and it get harder the more I try to do it. Is there any other environments like this, or what paper can I read more about this kind of Reinforcement Learning ? submitted by /u/Open_Ranger4375 [link] [comments]  ( 41 min )
    agent not learning using dqn.
    Hello forum, I am trying to get a single joint actuated link to stand upright. I am using dqn as a method for the agent to learn. I have tried using different inputs and outputs with the neural net but is still failing to learn. Can you take a look at my code? #!/usr/bin/env python3 from interbotix_xs_modules.arm import InterbotixManipulatorXS import rospy import time import numpy as np import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim import math import random import matplotlib.pyplot as plt ​ from std_msgs.msg import Float64 from gazebo_msgs.msg import LinkStates from geometry_msgs.msg import Pose, Twist from std_srvs.srv import Empty from sensor_msgs.msg import JointState ​ bot = InterbotixManipulatorXS(robot_model="rx15…  ( 44 min )
  • Open

    ­­How CCC Intelligent Solutions created a custom approach for hosting complex AI models using Amazon SageMaker
    This post is co-written by Christopher Diaz, Sam Kinard, Jaime Hidalgo and Daniel Suarez  from CCC Intelligent Solutions. In this post, we discuss how CCC Intelligent Solutions (CCC) combined Amazon SageMaker with other AWS services to create a custom solution capable of hosting the types of complex artificial intelligence (AI) models envisioned. CCC is a […]  ( 13 min )
  • Open

    Oval orbits?
    Johannes Kepler thought that planetary orbits were ellipses. Giovanni Cassini thought they were ovals. Kepler was right, but Cassini wasn’t far off. In everyday speech, people use the words ellipse and oval interchangeably. But in mathematics these terms are distinct. There is one definition of an ellipse, and several definitions of an oval. To be […] Oval orbits? first appeared on John D. Cook.  ( 6 min )
    Cassini ovals
    An ellipse can be defined as the set of points such that the sum of the distances to two fixed points, the foci, has a constant value. A Cassini oval is the set of points such that the product of the distances to two foci has a constant value. You can write down an equation […] Cassini ovals first appeared on John D. Cook.  ( 5 min )
    Bounds on power series coefficients
    Let f be an analytic function on the unit disk with f(0) = 0 and derivative f ′(0) = 1. If f is one-to-one (injective) then this puts a strict limit on the size of the series coefficients. Let an be the nth coefficient in the power series for f centered at 0. If f is one-to-one […] Bounds on power series coefficients first appeared on John D. Cook.  ( 5 min )
  • Open

    What Is AI Computing?
    The abacus, sextant, slide rule and computer. Mathematical instruments mark the history of human progress. They’ve enabled trade and helped navigate oceans, and advanced understanding and quality of life. The latest tool propelling science and industry is AI computing. AI Computing Defined AI computing is the math-intensive process of calculating machine learning algorithms, typically using Read article >  ( 8 min )
  • Open

    MIT researchers develop an AI model that can detect future lung cancer risk
    Deep-learning model takes a personalized approach to assessing each patient’s risk of lung cancer based on CT scans.  ( 10 min )
  • Open

    Fully Autonomous Real-World Reinforcement Learning with Applications to Mobile Manipulation
    Reinforcement learning provides a conceptual framework for autonomous agents to learn from experience, analogously to how one might train a pet with treats. But practical applications of reinforcement learning are often far from natural: instead of using RL to learn through trial and error by actually attempting the desired task, typical RL applications use a separate (usually simulated) training phase. For example, AlphaGo did not learn to play Go by competing against thousands of humans, but rather by playing against itself in simulation. While this kind of simulated training is appealing for games where the rules are perfectly known, applying this to real world domains such as robotics can require a range of complex approaches, such as the use of simulated data, or instrumenting real-wo…  ( 4 min )
  • Open

    Reverse Differentiation via Predictive Coding. (arXiv:2103.04689v3 [cs.LG] UPDATED)
    Deep learning has redefined the field of artificial intelligence (AI) thanks to the rise of artificial neural networks, which are architectures inspired by their neurological counterpart in the brain. Through the years, this dualism between AI and neuroscience has brought immense benefits to both fields, allowing neural networks to be used in dozens of applications. These networks use an efficient implementation of reverse differentiation, called backpropagation (BP). This algorithm, however, is often criticized for its biological implausibility (e.g., lack of local update rules for the parameters). Therefore, biologically plausible learning methods that rely on predictive coding (PC), a framework for describing information processing in the brain, are increasingly studied. Recent works prove that these methods can approximate BP up to a certain margin on multilayer perceptrons (MLPs), and asymptotically on any other complex model, and that zero-divergence inference learning (Z-IL), a variant of PC, is able to exactly implement BP on MLPs. However, the recent literature shows also that there is no biologically plausible method yet that can exactly replicate the weight update of BP on complex models. To fill this gap, in this paper, we generalize (PC and) Z-IL by directly defining them on computational graphs, and show that it can perform exact reverse differentiation. What results is the first biologically plausible algorithm that is equivalent to BP in the way of updating parameters on any neural network, providing a bridge between the interdisciplinary research of neuroscience and deep learning.  ( 2 min )
    Digital Twin-Based Multiple Access Optimization and Monitoring via Model-Driven Bayesian Learning. (arXiv:2210.05582v2 [eess.SP] UPDATED)
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control and monitor software-based, "open", communication systems, which play the role of the physical twin (PT). In the general framework presented in this work, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL) and monitoring of the PT for anomaly detection. We specifically investigate the application of the proposed framework to a simple case-study system encompassing multiple sensing devices that report to a common receiver. The Bayesian model trained at the DT has the key advantage of capturing epistemic uncertainty regarding the communication system, e.g., regarding current traffic conditions, which arise from limited PT-to-DT data transfer. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact. (arXiv:2106.02359v3 [cs.CL] UPDATED)
    Recent years have seen many breakthroughs in natural language processing (NLP), transitioning it from a mostly theoretical field to one with many real-world applications. Noting the rising number of applications of other machine learning and AI techniques with pervasive societal impact, we anticipate the rising importance of developing NLP technologies for social good. Inspired by theories in moral philosophy and global priorities research, we aim to promote a guideline for social good in the context of NLP. We lay the foundations via the moral philosophy definition of social good, propose a framework to evaluate the direct and indirect real-world impact of NLP tasks, and adopt the methodology of global priorities research to identify priority causes for NLP research. Finally, we use our theoretical framework to provide some practical guidelines for future NLP research for social good. Our data and code are available at this http URL In addition, we curate a list of papers and resources on NLP for social good at https://github.com/zhijing-jin/NLP4SocialGood_Papers.  ( 2 min )
    - Modelling Difference Between Censored and Uncensored Electric Vehicle Charging Demand. (arXiv:2301.06418v2 [cs.AI] UPDATED)
    Electric vehicle charging demand models, with charging records as input, will inherently be biased toward the supply of available chargers, as the data do not include demand lost from occupied stations and competitors. This lost demand implies that the records only observe a fraction of the total demand, i.e. the observations are censored, and actual demand is likely higher than what the data reflect. Machine learning models often neglect to account for this censored demand when forecasting the charging demand, which limits models' applications for future expansions and supply management. We address this gap by modelling the charging demand with probabilistic censorship-aware graph neural networks, which learn the latent demand distribution in both the spatial and temporal dimensions. We use GPS trajectories from cars in Copenhagen, Denmark, to study how censoring occurs and much demand is lost due to occupied charging and competing services. We find that censorship varies throughout the city and over time, encouraging spatial and temporal modelling. We find that in some regions of Copenhagen, censorship occurs 61% of the time. Our results show censorship-aware models provide better prediction and uncertainty estimation in actual future demand than censorship-unaware models. Our results suggest that future models based on charging records should account for the censoring to expand the application areas of machine learning models in this supply management and infrastructure expansion.  ( 2 min )
    Scalable Deep Graph Clustering with Random-walk based Self-supervised Learning. (arXiv:2112.15530v2 [cs.LG] UPDATED)
    Web-based interactions can be frequently represented by an attributed graph, and node clustering in such graphs has received much attention lately. Multiple efforts have successfully applied Graph Convolutional Networks (GCN), though with some limits on accuracy as GCNs have been shown to suffer from over-smoothing issues. Though other methods (particularly those based on Laplacian Smoothing) have reported better accuracy, a fundamental limitation of all the work is a lack of scalability. This paper addresses this open problem by relating the Laplacian smoothing to the Generalized PageRank and applying a random-walk based algorithm as a scalable graph filter. This forms the basis for our scalable deep clustering algorithm, RwSL, where through a self-supervised mini-batch training mechanism, we simultaneously optimize a deep neural network for sample-cluster assignment distribution and an autoencoder for a clustering-oriented embedding. Using 6 real-world datasets and 6 clustering metrics, we show that RwSL achieved improved results over several recent baselines. Most notably, we show that RwSL, unlike all other deep clustering frameworks, can continue to scale beyond graphs with more than one million nodes, i.e., handle web-scale. We also demonstrate how RwSL could perform node clustering on a graph with 1.8 billion edges using only a single GPU.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v2 [math.ST] UPDATED)
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Adversarial AI in Insurance: Pervasiveness and Resilience. (arXiv:2301.07520v1 [cs.LG])
    The rapid and dynamic pace of Artificial Intelligence (AI) and Machine Learning (ML) is revolutionizing the insurance sector. AI offers significant, very much welcome advantages to insurance companies, and is fundamental to their customer-centricity strategy. It also poses challenges, in the project and implementation phase. Among those, we study Adversarial Attacks, which consist of the creation of modified input data to deceive an AI system and produce false outputs. We provide examples of attacks on insurance AI applications, categorize them, and argue on defence methods and precautionary systems, considering that they can involve few-shot and zero-shot multilabelling. A related topic, with growing interest, is the validation and verification of systems incorporating AI and ML components. These topics are discussed in various sections of this paper.  ( 2 min )
    Global Contrastive Batch Sampling via Optimization on Sample Permutations. (arXiv:2210.12874v3 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Weight Matrix Dimensionality Reduction in Deep Learning via Kronecker Multi-layer Architectures. (arXiv:2204.04273v2 [cs.LG] UPDATED)
    Deep learning using neural networks is an effective technique for generating models of complex data. However, training such models can be expensive when networks have large model capacity resulting from a large number of layers and nodes. For training in such a computationally prohibitive regime, dimensionality reduction techniques ease the computational burden, and allow implementations of more robust networks. We propose a novel type of such dimensionality reduction via a new deep learning architecture based on fast matrix multiplication of a Kronecker product decomposition; in particular our network construction can be viewed as a Kronecker product-induced sparsification of an "extended" fully connected network. Analysis and practical examples show that this architecture allows a neural network to be trained and implemented with a significant reduction in computational time and resources, while achieving a similar error level compared to a traditional feedforward neural network.  ( 2 min )
    Dirichlet-Neumann learning algorithm for solving elliptic interface problems. (arXiv:2301.07361v1 [math.NA])
    Non-overlapping domain decomposition methods are natural for solving interface problems arising from various disciplines, however, the numerical simulation requires technical analysis and is often available only with the use of high-quality grids, thereby impeding their use in more complicated situations. To remove the burden of mesh generation and to effectively tackle with the interface jump conditions, a novel mesh-free scheme, i.e., Dirichlet-Neumann learning algorithm, is proposed in this work to solve the benchmark elliptic interface problem with high-contrast coefficients as well as irregular interfaces. By resorting to the variational principle, we carry out a rigorous error analysis to evaluate the discrepancy caused by the boundary penalty treatment for each decomposed subproblem, which paves the way for realizing the Dirichlet-Neumann algorithm using neural network extension operators. The effectiveness and robustness of our proposed methods are demonstrated experimentally through a series of elliptic interface problems, achieving better performance over other alternatives especially in the presence of erroneous flux prediction at interface.  ( 2 min )
    CLIPTER: Looking at the Bigger Picture in Scene Text Recognition. (arXiv:2301.07464v1 [cs.CV])
    Understanding the scene is often essential for reading text in real-world scenarios. However, current scene text recognizers operate on cropped text images, unaware of the bigger picture. In this work, we harness the representative power of recent vision-language models, such as CLIP, to provide the crop-based recognizer with scene, image-level information. Specifically, we obtain a rich representation of the entire image and fuse it with the recognizer word-level features via cross-attention. Moreover, a gated mechanism is introduced that gradually shifts to the context-enriched representation, enabling simply fine-tuning a pretrained recognizer. We implement our model-agnostic framework, named CLIPTER - CLIP Text Recognition, on several leading text recognizers and demonstrate consistent performance gains, achieving state-of-the-art results over multiple benchmarks. Furthermore, an in-depth analysis reveals improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.  ( 2 min )
    Prompting Large Language Model for Machine Translation: A Case Study. (arXiv:2301.07069v2 [cs.CL] UPDATED)
    Research on prompting has shown excellent performance with little or even no supervised training across many tasks. However, prompting for machine translation is still under-explored in the literature. We fill this gap by offering a systematic study on prompting strategies for translation, examining various factors for prompt template and demonstration example selection. We further explore the use of monolingual data and the feasibility of cross-lingual, cross-domain, and sentence-to-document transfer learning in prompting. Extensive experiments with GLM-130B (Zeng et al., 2022) as the testbed show that 1) the number and the quality of prompt examples matter, where using suboptimal examples degenerates translation; 2) several features of prompt examples, such as semantic similarity, show significant Spearman correlation with their prompting performance; yet, none of the correlations are strong enough; 3) using pseudo parallel prompt examples constructed from monolingual data via zero-shot prompting could improve translation; and 4) improved performance is achievable by transferring knowledge from prompt examples selected in other settings. We finally provide an analysis on the model outputs and discuss several problems that prompting still suffers from.  ( 2 min )
    Neural DAEs: Constrained neural networks. (arXiv:2211.14302v2 [cs.LG] UPDATED)
    In this article we investigate the effect of explicitly adding auxiliary trajectory information to neural networks for dynamical systems. We draw inspiration from the field of differential-algebraic equations and differential equations on manifolds and implement similar methods in residual neural networks. We discuss constraints through stabilization as well as projection methods, and show when to use which method based on experiments involving simulations of multi-body pendulums and molecular dynamics scenarios. Several of our methods are easy to implement in existing code and have limited impact on training performance while giving significant boosts in terms of inference.  ( 2 min )
    Nostradamus: Weathering Worth. (arXiv:2212.05933v2 [q-fin.ST] UPDATED)
    Nostradamus, inspired by the French astrologer and reputed seer, is a detailed study exploring relations between environmental factors and changes in the stock market. In this paper, we analyze associative correlation and causation between environmental elements (including natural disasters, climate and weather conditions) and stock prices, using historical stock market data, historical climate data, and various climate indicators such as carbon dioxide emissions. We have conducted our study based on the US financial market, global climate trends, and daily weather records to demonstrate a significant relationship between climate and stock price fluctuation. Our analysis covers both short-term and long-term rises and dips in company stock performances. Lastly, we take four natural disasters as a case study to observe the effect they have on people's emotional state and their influence on the stock market.  ( 2 min )
    Quantification of geogrid lateral restraint using transparent sand and deep learning-based image segmentation. (arXiv:2212.02939v2 [physics.geo-ph] UPDATED)
    An experimental technique is presented to quantify the lateral restraint provided by a geogrid embedded in granular soil at the particle level. Repeated load triaxial tests were done on transparent sand specimens with geosynthetic inclusions simulating geogrids. Particle outlines on laser illuminated planes through the specimens were segmented using a deep learning-based segmentation algorithm. The particle outlines were characterized in terms of Fourier shape descriptors and tracked across sequentially captured images. The accuracy of the particle displacement measurements was validated against Digital Image Correlation (DIC) measurements. In addition, the method's resolution and repeatability is presented. Based on the measured particle displacements and rotations, a state boundary line between probable and improbable particle motions was identified for each test. The size of the zone of probable motions could be used to quantify the lateral restraint provided by the inclusions. Overall, the tests results revealed that the geosynthetic inclusions restricted both particle displacements and rotations. However, the particle displacements were found to be restrained more significantly than the rotations. Finally, a unique relationship was found between the magnitude of the permanent strains of the specimens and the size of the zone of probable motions.  ( 2 min )
    Generalized Many-Body Dispersion Correction through Random-phase Approximation for Chemically Accurate Density Functional Theory. (arXiv:2210.09784v4 [physics.chem-ph] UPDATED)
    We extend our recently proposed Deep Learning-aided many-body dispersion (DNN-MBD) model to quadrupole polarizability (Q) terms using a generalized Random Phase Approximation (RPA) formalism, thus enabling the inclusion of van der Waals contributions beyond dipole. The resulting DNN-MBDQ model only relies on ab initio-derived quantities as the introduced quadrupole polarizabilities are recursively retrieved from dipole ones, in turn modelled via the Tkatchenko-Scheffler method. A transferable and efficient deep-neuronal network (DNN) provides atom in molecule volumes, while a single range-separation parameter is used to couple the model to Density Functional Theory (DFT). Since it can be computed at a negligible cost, the DNN-MBDQ approach can be coupled with DFT functionals such as PBE,PBE0 and B86bPBE (dispersionless). The DNN-MBQ-corrected functionals reach chemical accuracy while exhibiting lower errors compared to their dipole-only counterparts.  ( 2 min )
    Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models. (arXiv:2301.06267v2 [cs.CV] UPDATED)
    The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better ${\bf visual}$ dog classifier by ${\bf read}$ing about dogs and ${\bf listen}$ing to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification.  ( 2 min )
    Spiking Neural Network Decision Feedback Equalization. (arXiv:2211.04756v3 [eess.SP] UPDATED)
    In the past years, artificial neural networks (ANNs) have become the de-facto standard to solve tasks in communications engineering that are difficult to solve with traditional methods. In parallel, the artificial intelligence community drives its research to biology-inspired, brain-like spiking neural networks (SNNs), which promise extremely energy-efficient computing. In this paper, we investigate the use of SNNs in the context of channel equalization for ultra-low complexity receivers. We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE). For conversion of real-world data into spike signals we introduce a novel ternary encoding and compare it with traditional log-scale encoding. We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels. We highlight that mainly the conversion of the channel output to spikes introduces a small performance penalty. The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.  ( 2 min )
    Antenna Array Calibration Via Gaussian Process Models. (arXiv:2301.06582v2 [eess.SP] UPDATED)
    Antenna array calibration is necessary to maintain the high fidelity of beam patterns across a wide range of advanced antenna systems and to ensure channel reciprocity in time division duplexing schemes. Despite the continuous development in this area, most existing solutions are optimised for specific radio architectures, require standardised over-the-air data transmission, or serve as extensions of conventional methods. The diversity of communication protocols and hardware creates a problematic case, since this diversity requires to design or update the calibration procedures for each new advanced antenna system. In this study, we formulate antenna calibration in an alternative way, namely as a task of functional approximation, and address it via Bayesian machine learning. Our contributions are three-fold. Firstly, we define a parameter space, based on near-field measurements, that captures the underlying hardware impairments corresponding to each radiating element, their positional offsets, as well as the mutual coupling effects between antenna elements. Secondly, Gaussian process regression is used to form models from a sparse set of the aforementioned near-field data. Once deployed, the learned non-parametric models effectively serve to continuously transform the beamforming weights of the system, resulting in corrected beam patterns. Lastly, we demonstrate the viability of the described methodology for both digital and analog beamforming antenna arrays of different scales and discuss its further extension to support real-time operation with dynamic hardware impairments.  ( 2 min )
    Teacher Forcing Recovers Reward Functions for Text Generation. (arXiv:2210.08708v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) has been widely used in text generation to alleviate the exposure bias issue or to utilize non-parallel datasets. The reward function plays an important role in making RL training successful. However, previous reward functions are typically task-specific and sparse, restricting the use of RL. In our work, we propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing. We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function. Empirical results show that our method outperforms self-training and reward regression methods on several text generation tasks, confirming the effectiveness of our reward function.  ( 2 min )
    Planning and Learning with Adaptive Lookahead. (arXiv:2201.12403v2 [cs.LG] UPDATED)
    Some of the most powerful reinforcement learning frameworks use planning for action selection. Interestingly, their planning horizon is either fixed or determined arbitrarily by the state visitation history. Here, we expand beyond the naive fixed horizon and propose a theoretically justified strategy for adaptive selection of the planning horizon as a function of the state-dependent value estimate. We propose two variants for lookahead selection and analyze the trade-off between iteration count and computational complexity per iteration. We then devise a corresponding deep Q-network algorithm with an adaptive tree search horizon. We separate the value estimation per depth to compensate for the off-policy discrepancy between depths. Lastly, we demonstrate the efficacy of our adaptive lookahead method in a maze environment and Atari.  ( 2 min )
    An Analysis of Loss Functions for Binary Classification and Regression. (arXiv:2301.07638v1 [stat.ML])
    This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.  ( 2 min )
    Operator Learning Framework for Digital Twin and Complex Engineering Systems. (arXiv:2301.06701v2 [cs.LG] UPDATED)
    With modern computational advancements and statistical analysis methods, machine learning algorithms have become a vital part of engineering modeling. Neural Operator Networks (ONets) is an emerging machine learning algorithm as a "faster surrogate" for approximating solutions to partial differential equations (PDEs) due to their ability to approximate mathematical operators versus the direct approximation of Neural Networks (NN). ONets use the Universal Approximation Theorem to map finite-dimensional inputs to infinite-dimensional space using the branch-trunk architecture, which encodes domain and feature information separately before using a dot product to combine the information. ONets are expected to occupy a vital niche for surrogate modeling in physical systems and Digital Twin (DT) development. Three test cases are evaluated using ONets for operator approximation, including a 1-dimensional ordinary differential equations (ODE), general diffusion system, and convection-diffusion (Burger) system. Solutions for ODE and diffusion systems yield accurate and reliable results (R2>0.95), while solutions for Burger systems need further refinement in the ONet algorithm.  ( 2 min )
    Joint Representation Learning for Text and 3D Point Cloud. (arXiv:2301.07584v1 [cs.CV])
    Recent advancements in vision-language pre-training (e.g. CLIP) have shown that vision models can benefit from language supervision. While many models using language modality have achieved great success on 2D vision tasks, the joint representation learning of 3D point cloud with text remains under-explored due to the difficulty of 3D-Text data pair acquisition and the irregularity of 3D data structure. In this paper, we propose a novel Text4Point framework to construct language-guided 3D point cloud models. The key idea is utilizing 2D images as a bridge to connect the point cloud and the language modalities. The proposed Text4Point follows the pre-training and fine-tuning paradigm. During the pre-training stage, we establish the correspondence of images and point clouds based on the readily available RGB-D data and use contrastive learning to align the image and point cloud representations. Together with the well-aligned image and text features achieved by CLIP, the point cloud features are implicitly aligned with the text embeddings. Further, we propose a Text Querying Module to integrate language information into 3D representation learning by querying text embeddings with point cloud features. For fine-tuning, the model learns task-specific 3D representations under informative language guidance from the label set without 2D images. Extensive experiments demonstrate that our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection. The code will be available here: https://github.com/LeapLabTHU/Text4Point  ( 2 min )
    A Bayesian Framework for Digital Twin-Based Control, Monitoring, and Data Collection in Wireless Systems. (arXiv:2212.01351v2 [eess.SP] UPDATED)
    Commonly adopted in the manufacturing and aerospace sectors, digital twin (DT) platforms are increasingly seen as a promising paradigm to control, monitor, and analyze software-based, "open", communication systems. Notably, DT platforms provide a sandbox in which to test artificial intelligence (AI) solutions for communication systems, potentially reducing the need to collect data and test algorithms in the field, i.e., on the physical twin (PT). A key challenge in the deployment of DT systems is to ensure that virtual control optimization, monitoring, and analysis at the DT are safe and reliable, avoiding incorrect decisions caused by "model exploitation". To address this challenge, this paper presents a general Bayesian framework with the aim of quantifying and accounting for model uncertainty at the DT that is caused by limitations in the amount and quality of data available at the DT from the PT. In the proposed framework, the DT builds a Bayesian model of the communication system, which is leveraged to enable core DT functionalities such as control via multi-agent reinforcement learning (MARL), monitoring of the PT for anomaly detection, prediction, data-collection optimization, and counterfactual analysis. To exemplify the application of the proposed framework, we specifically investigate a case-study system encompassing multiple sensing devices that report to a common receiver. Experimental results validate the effectiveness of the proposed Bayesian framework as compared to standard frequentist model-based solutions.  ( 2 min )
    Learning Task-Oriented Communication for Edge Inference: An Information Bottleneck Approach. (arXiv:2102.04170v3 [eess.SP] UPDATED)
    This paper investigates task-oriented communication for edge inference, where a low-end edge device transmits the extracted feature vector of a local data sample to a powerful edge server for processing. It is critical to encode the data into an informative and compact representation for low-latency inference given the limited bandwidth. We propose a learning-based communication scheme that jointly optimizes feature extraction, source coding, and channel coding in a task-oriented manner, i.e., targeting the downstream inference task rather than data reconstruction. Specifically, we leverage an information bottleneck (IB) framework to formalize a rate-distortion tradeoff between the informativeness of the encoded feature and the inference performance. As the IB optimization is computationally prohibitive for the high-dimensional data, we adopt a variational approximation, namely the variational information bottleneck (VIB), to build a tractable upper bound. To reduce the communication overhead, we leverage a sparsity-inducing distribution as the variational prior for the VIB framework to sparsify the encoded feature vector. Furthermore, considering dynamic channel conditions in practical communication systems, we propose a variable-length feature encoding scheme based on dynamic neural networks to adaptively adjust the activated dimensions of the encoded feature to different channel conditions. Extensive experiments evidence that the proposed task-oriented communication system achieves a better rate-distortion tradeoff than baseline methods and significantly reduces the feature transmission latency in dynamic channel conditions.  ( 2 min )
    Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition. (arXiv:2202.07901v2 [cs.LG] UPDATED)
    Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show improved classification accuracy, faster convergence, and better generalizability.  ( 2 min )
    Theseus: A Library for Differentiable Nonlinear Optimization. (arXiv:2207.09442v3 [cs.RO] UPDATED)
    We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several example applications that are built using the same underlying differentiable components, such as second-order optimizers, standard costs functions, and Lie groups. For efficiency, Theseus incorporates support for sparse solvers, automatic vectorization, batching, GPU acceleration, and gradient computation with implicit differentiation and direct loss minimization. We do extensive performance evaluation in a set of applications, demonstrating significant efficiency gains and better scalability when these features are incorporated. Project page: https://sites.google.com/view/theseus-ai  ( 2 min )
    Non-parametric identifiability and sensitivity analysis of synthetic control models. (arXiv:2301.07656v1 [stat.ME])
    Quantifying cause and effect relationships is an important problem in many domains. The gold standard solution is to conduct a randomised controlled trial. However, in many situations such trials cannot be performed. In the absence of such trials, many methods have been devised to quantify the causal impact of an intervention from observational data given certain assumptions. One widely used method are synthetic control models. While identifiability of the causal estimand in such models has been obtained from a range of assumptions, it is widely and implicitly assumed that the underlying assumptions are satisfied for all time periods both pre- and post-intervention. This is a strong assumption, as synthetic control models can only be learned in pre-intervention period. In this paper we address this challenge, and prove identifiability can be obtained without the need for this assumption, by showing it follows from the principle of invariant causal mechanisms. Moreover, for the first time, we formulate and study synthetic control models in Pearl's structural causal model framework. Importantly, we provide a general framework for sensitivity analysis of synthetic control causal inference to violations of the assumptions underlying non-parametric identifiability. We end by providing an empirical demonstration of our sensitivity analysis framework on simulated and real data in the widely-used linear synthetic control framework.  ( 2 min )
    Comprehensive Literature Survey on Deep Learning used in Image Memorability Prediction and Modification. (arXiv:2301.06080v2 [cs.CV] UPDATED)
    As humans, we can remember certain visuals in great detail, and sometimes even after viewing them once. What is even more interesting is that humans tend to remember and forget the same things, suggesting that there might be some general internal characteristics of an image to encode and discard similar types of information. Research suggests that some pictures tend to be memorized more than others. The ability of an image to be remembered by different viewers is one of its intrinsic properties. In visualization and photography, creating memorable images is a difficult task. Hence, to solve the problem, various techniques predict visual memorability and manipulate images' memorability. We present a comprehensive literature survey to assess the deep learning techniques used to predict and modify memorability. In particular, we analyze the use of Convolutional Neural Networks, Recurrent Neural Networks, and Generative Adversarial Networks for image memorability prediction and modification.  ( 2 min )
    Factors other than climate change are currently more important in predicting how well fruit farms are doing financially. (arXiv:2301.07685v1 [cs.LG])
    Machine learning and statistical modeling methods were used to analyze the impact of climate change on financial wellbeing of fruit farmers in Tunisia and Chile. The analysis was based on face to face interviews with 801 farmers. Three research questions were investigated. First, whether climate change impacts had an effect on how well the farm was doing financially. Second, if climate change was not influential, what factors were important for predicting financial wellbeing of the farm. And third, ascertain whether observed effects on the financial wellbeing of the farm were a result of interactions between predictor variables. This is the first report directly comparing climate change with other factors potentially impacting financial wellbeing of farms. Certain climate change factors, namely increases in temperature and reductions in precipitation, can regionally impact self-perceived financial wellbeing of fruit farmers. Specifically, increases in temperature and reduction in precipitation can have a measurable negative impact on the financial wellbeing of farms in Chile. This effect is less pronounced in Tunisia. Climate impact differences were observed within Chile but not in Tunisia. However, climate change is only of minor importance for predicting farm financial wellbeing, especially for farms already doing financially well. Factors that are more important, mainly in Tunisia, included trust in information sources and prior farm ownership. Other important factors include farm size, water management systems used and diversity of fruit crops grown. Moreover, some of the important factors identified differed between farms doing and not doing well financially. Interactions between factors may improve or worsen farm financial wellbeing.  ( 2 min )
    An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. (arXiv:2103.15990v7 [cs.HC] UPDATED)
    With the rapid development of the internet of things (IoT) and artificial intelligence (AI) technologies, human activity recognition (HAR) has been applied in a variety of domains such as security and surveillance, human-robot interaction, and entertainment. Even though a number of surveys and review papers have been published, there is a lack of HAR overview papers focusing on healthcare applications that use wearable sensors. Therefore, we fill in the gap by presenting this overview paper. In particular, we present our projects to illustrate the system design of HAR applications for healthcare. Our projects include early mobility identification of human activities for intensive care unit (ICU) patients and gait analysis of Duchenne muscular dystrophy (DMD) patients. We cover essential components of designing HAR systems including sensor factors (e.g., type, number, and placement location), AI model selection (e.g., classical machine learning models versus deep learning models), and feature engineering. In addition, we highlight the challenges of such healthcare-oriented HAR systems and propose several research opportunities for both the medical and the computer science community.  ( 2 min )
    Improving Federated Learning Personalization via Model Agnostic Meta Learning. (arXiv:1909.12488v2 [cs.LG] UPDATED)
    Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research.  ( 2 min )
    Creating awareness about security and safety on highways to mitigate wildlife-vehicle collisions by detecting and recognizing wildlife fences using deep learning and drone technology. (arXiv:2301.07174v1 [cs.CV])
    In South Africa, it is a common practice for people to leave their vehicles beside the road when traveling long distances for a short comfort break. This practice might increase human encounters with wildlife, threatening their security and safety. Here we intend to create awareness about wildlife fencing, using drone technology and computer vision algorithms to recognize and detect wildlife fences and associated features. We collected data at Amakhala and Lalibela private game reserves in the Eastern Cape, South Africa. We used wildlife electric fence data containing single and double fences for the classification task. Additionally, we used aerial and still annotated images extracted from the drone and still cameras for the segmentation and detection tasks. The model training results from the drone camera outperformed those from the still camera. Generally, poor model performance is attributed to (1) over-decompression of images and (2) the ability of drone cameras to capture more details on images for the machine learning model to learn as compared to still cameras that capture only the front view of the wildlife fence. We argue that our model can be deployed on client-edge devices to inform people about the presence and significance of wildlife fencing, which minimizes human encounters with wildlife, thereby mitigating wildlife-vehicle collisions.  ( 2 min )
    Enhancing Self-Training Methods. (arXiv:2301.07294v1 [cs.LG])
    Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data. Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias" that occurs when the student model repeatedly overfits to incorrect pseudo-labels given by the teacher model for the unlabeled data. This bias impedes improvements in pseudo-label accuracy across self-training iterations, leading to unwanted saturation in model performance after just a few iterations. In this work, we describe multiple enhancements to improve the self-training pipeline to mitigate the effect of confirmation bias. We evaluate our enhancements over multiple datasets showing performance gains over existing self-training design choices. Finally, we also study the extendability of our enhanced approach to Open Set unlabeled data (containing classes not seen in labeled data).  ( 2 min )
    Safety Verification of Neural Network Control Systems Using Guaranteed Neural Network Model Reduction. (arXiv:2301.07531v1 [cs.LG])
    This paper aims to enhance the computational efficiency of safety verification of neural network control systems by developing a guaranteed neural network model reduction method. First, a concept of model reduction precision is proposed to describe the guaranteed distance between the outputs of a neural network and its reduced-size version. A reachability-based algorithm is proposed to accurately compute the model reduction precision. Then, by substituting a reduced-size neural network controller into the closed-loop system, an algorithm to compute the reachable set of the original system is developed, which is able to support much more computationally efficient safety verification processes. Finally, the developed methods are applied to a case study of the Adaptive Cruise Control system with a neural network controller, which is shown to significantly reduce the computational time of safety verification and thus validate the effectiveness of the method.  ( 2 min )
    Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams. (arXiv:2202.08312v3 [cs.LG] UPDATED)
    Motivated by recent applications requiring differential privacy over adaptive streams, we investigate the question of optimal instantiations of the matrix mechanism in this setting. We prove fundamental theoretical results on the applicability of matrix factorizations to adaptive streams, and provide a parameter-free fixed-point algorithm for computing optimal factorizations. We instantiate this framework with respect to concrete matrices which arise naturally in machine learning, and train user-level differentially private models with the resulting optimal mechanisms, yielding significant improvements in a notable problem in federated learning with user-level differential privacy.  ( 2 min )
    Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. (arXiv:2210.07675v3 [cs.CV] UPDATED)
    We present a system for anomaly detection in histopathological images. In histology, normal samples are usually abundant, whereas anomalous (pathological) cases are scarce or not available. Under such settings, one-class classifiers trained on healthy data can detect out-of-distribution anomalous samples. Such approaches combined with pre-trained Convolutional Neural Network (CNN) representations of images were previously employed for anomaly detection (AD). However, pre-trained off-the-shelf CNN representations may not be sensitive to abnormal conditions in tissues, while natural variations of healthy tissue may result in distant representations. To adapt representations to relevant details in healthy tissue we propose training a CNN on an auxiliary task that discriminates healthy tissue of different species, organs, and staining reagents. Almost no additional labeling workload is required, since healthy samples come automatically with aforementioned labels. During training we enforce compact image representations with a center-loss term, which further improves representations for AD. The proposed system outperforms established AD methods on a published dataset of liver anomalies. Moreover, it provided comparable results to conventional methods specifically tailored for quantification of liver anomalies. We show that our approach can be used for toxicity assessment of candidate drugs at early development stages and thereby may reduce expensive late-stage drug attrition.  ( 2 min )
    Universal Neural-Cracking-Machines: Self-Configurable Password Models from Auxiliary Data. (arXiv:2301.07628v1 [cs.CR])
    We develop the first universal password model -- a password model that, once pre-trained, can automatically adapt to any password distribution. To achieve this result, the model does not need to access any plaintext passwords from the target set. Instead, it exploits users' auxiliary information, such as email addresses, as a proxy signal to predict the underlying target password distribution. The model uses deep learning to capture the correlation between the auxiliary data of a group of users (e.g., users of a web application) and their passwords. It then exploits those patterns to create a tailored password model for the target community at inference time. No further training steps, targeted data collection, or prior knowledge of the community's password distribution is required. Besides defining a new state-of-the-art for password strength estimation, our model enables any end-user (e.g., system administrators) to autonomously generate tailored password models for their systems without the often unworkable requirement of collecting suitable training data and fitting the underlying password model. Ultimately, our framework enables the democratization of well-calibrated password models to the community, addressing a major challenge in the deployment of password security solutions on a large scale.  ( 2 min )
    Feature Alignment as a Generative Process. (arXiv:2106.12562v2 [cs.LG] UPDATED)
    Reversibility in artificial neural networks allows us to retrieve the input given an output. We present feature alignment, a method for approximating reversibility in arbitrary neural networks. We train a network by minimizing the distance between the output of a data point and the random output with respect to a random input. We applied the technique to the MNIST, CIFAR-10, CelebA and STL-10 image datasets. We demonstrate that this method can roughly recover images from just their latent representation without the need of a decoder. By utilizing the formulation of variational autoencoders, we demonstrate that it is possible to produce new images that are statistically comparable to the training data. Furthermore, we demonstrate that the quality of the images can be improved by coupling a generator and a discriminator together. In addition, we show how this method, with a few minor modifications, can be used to train networks locally, which has the potential to save computational memory resources.  ( 2 min )
    Cross-Domain Evaluation of a Deep Learning-Based Type Inference System. (arXiv:2208.09189v2 [cs.SE] UPDATED)
    Optional type annotations allow for enriching dynamic programming languages with static typing features like better Integrated Development Environment (IDE) support, more precise program analysis, and early detection and prevention of type-related runtime errors. Machine learning-based type inference promises interesting results for automating this task. However, the practical usage of such systems depends on their ability to generalize across different domains, as they are often applied outside their training domain. In this work, we investigate Type4Py as a representative of state-of-the-art deep learning-based type inference systems, by conducting extensive cross-domain experiments. Thereby, we address the following problems: class imbalances, out-of-vocabulary words, dataset shifts, and unknown classes. To perform such experiments, we use the datasets ManyTypes4Py and CrossDomainTypes4Py. The latter we introduce in this paper. Our dataset enables the evaluation of type inference systems in different domains of software projects and has over 1,000,000 type annotations mined on the platforms GitHub and Libraries. It consists of data from the two domains web development and scientific calculation. Through our experiments, we detect that the shifts in the dataset and the long-tailed distribution with many rare and unknown data types decrease the performance of the deep learning-based type inference system drastically. In this context, we test unsupervised domain adaptation methods and fine-tuning to overcome these issues. Moreover, we investigate the impact of out-of-vocabulary words.  ( 2 min )
    InstructPix2Pix: Learning to Follow Image Editing Instructions. (arXiv:2211.09800v2 [cs.CV] UPDATED)
    We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models -- a language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.  ( 2 min )
    Sample Complexity of Adversarially Robust Linear Classification on Separated Data. (arXiv:2012.10794v3 [cs.LG] UPDATED)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).  ( 2 min )
    An investigation of the reconstruction capacity of stacked convolutional autoencoders for log-mel-spectrograms. (arXiv:2301.07665v1 [cs.SD])
    In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand. These representations can be used to manipulate the timbre and influence the synthesis of creative instrumental notes. Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument timbre compression. Unsupervised deep learning methods can achieve audio compression by training the network to learn a mapping from waveforms or spectrograms to low-dimensional representations. This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch. Further exploration of hyper-parameters and regularization techniques is demonstrated to enhance the performance of the initial design. In an unsupervised manner, the network is able to reconstruct a monophonic and harmonic sound based on latent representations. In addition, we introduce an evaluation metric to measure the similarity between the original and reconstructed samples. Evaluating a deep generative model for the synthesis of sound is a challenging task. Our approach is based on the accuracy of the generated frequencies as it presents a significant metric for the perception of harmonic sounds. This work is expected to accelerate future experiments on audio compression using neural autoencoders.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v1 [eess.SP] CROSS LISTED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
    Concrete Score Matching: Generalized Score Matching for Discrete Data. (arXiv:2211.00802v2 [cs.LG] UPDATED)
    Representing probability distributions by the gradient of their density functions has proven effective in modeling a wide range of continuous data modalities. However, this representation is not applicable in discrete domains where the gradient is undefined. To this end, we propose an analogous score function called the "Concrete score", a generalization of the (Stein) score for discrete settings. Given a predefined neighborhood structure, the Concrete score of any input is defined by the rate of change of the probabilities with respect to local directional changes of the input. This formulation allows us to recover the (Stein) score in continuous domains when measuring such changes by the Euclidean distance, while using the Manhattan distance leads to our novel score function in discrete domains. Finally, we introduce a new framework to learn such scores from samples called Concrete Score Matching (CSM), and propose an efficient training objective to scale our approach to high dimensions. Empirically, we demonstrate the efficacy of CSM on density estimation tasks on a mixture of synthetic, tabular, and high-dimensional image datasets, and demonstrate that it performs favorably relative to existing baselines for modeling discrete data.  ( 2 min )
    Multimodal learning with graphs. (arXiv:2209.03299v5 [cs.LG] UPDATED)
    Artificial intelligence for graphs has achieved remarkable success in modeling complex systems, ranging from dynamic networks in biology to interacting particle systems in physics. However, the increasingly heterogeneous graph datasets call for multimodal methods that can combine different inductive biases: the set of assumptions that algorithms use to make predictions for inputs they have not encountered during training. Learning on multimodal datasets presents fundamental challenges because the inductive biases can vary by data modality and graphs might not be explicitly given in the input. To address these challenges, multimodal graph AI methods combine different modalities while leveraging cross-modal dependencies using graphs. Diverse datasets are combined using graphs and fed into sophisticated multimodal architectures, specified as image-intensive, knowledge-grounded and language-intensive models. Using this categorization, we introduce a blueprint for multimodal graph learning, use it to study existing methods and provide guidelines to design new models.  ( 2 min )
    Optimization-based Block Coordinate Gradient Coding for Mitigating Partial Stragglers in Distributed Learning. (arXiv:2206.02450v2 [cs.IT] UPDATED)
    Gradient coding schemes effectively mitigate full stragglers in distributed learning by introducing identical redundancy in coded local partial derivatives corresponding to all model parameters. However, they are no longer effective for partial stragglers as they cannot utilize incomplete computation results from partial stragglers. This paper aims to design a new gradient coding scheme for mitigating partial stragglers in distributed learning. Specifically, we consider a distributed system consisting of one master and N workers, characterized by a general partial straggler model and focuses on solving a general large-scale machine learning problem with L model parameters using gradient coding. First, we propose a coordinate gradient coding scheme with L coding parameters representing L possibly different diversities for the L coordinates, which generates most gradient coding schemes. Then, we consider the minimization of the expected overall runtime and the maximization of the completion probability with respect to the L coding parameters for coordinates, which are challenging discrete optimization problems. To reduce computational complexity, we first transform each to an equivalent but much simpler discrete problem with N\llL variables representing the partition of the L coordinates into N blocks, each with identical redundancy. This indicates an equivalent but more easily implemented block coordinate gradient coding scheme with N coding parameters for blocks. Then, we adopt continuous relaxation to further reduce computational complexity. For the resulting minimization of expected overall runtime, we develop an iterative algorithm of computational complexity O(N^2) to obtain an optimal solution and derive two closed-form approximate solutions both with computational complexity O(N). For the resultant maximization of the completion probability, we develop an iterative algorithm of...  ( 3 min )
    Hybrid quantum-classical convolutional neural networks to improve molecular protein binding affinity predictions. (arXiv:2301.06331v2 [quant-ph] UPDATED)
    One of the main challenges in drug discovery is to find molecules that bind specifically and strongly to their target protein while having minimal binding to other proteins. By predicting binding affinity, it is possible to identify the most promising candidates from a large pool of potential compounds, reducing the number of compounds that need to be tested experimentally. Recently, deep learning methods have shown superior performance than traditional computational methods for making accurate predictions on large datasets. However, the complexity and time-consuming nature of these methods have limited their usage and development. Quantum machine learning is an emerging technology that has the potential to improve many classical machine learning algorithms. In this work we present a hybrid quantum-classical convolutional neural network, which is able to reduce by 20% the complexity of the classical network while maintaining optimal performance in the predictions. Additionally, it results in a significant time savings of up to 40% in the training process, which means a meaningful speed up of the drug discovery process.  ( 2 min )
    Consistent Non-Parametric Methods for Maximizing Robustness. (arXiv:2102.09086v3 [cs.LG] UPDATED)
    Learning classifiers that are robust to adversarial examples has received a great deal of recent attention. A major drawback of the standard robust learning framework is there is an artificial robustness radius $r$ that applies to all inputs. This ignores the fact that data may be highly heterogeneous, in which case it is plausible that robustness regions should be larger in some regions of data, and smaller in others. In this paper, we address this limitation by proposing a new limit classifier, called the neighborhood optimal classifier, that extends the Bayes optimal classifier outside its support by using the label of the closest in-support point. We then argue that this classifier maximizes the size of its robustness regions subject to the constraint of having accuracy equal to the Bayes optimal. We then present sufficient conditions under which general non-parametric methods that can be represented as weight functions converge towards this limit, and show that both nearest neighbors and kernel classifiers satisfy them under certain conditions.  ( 2 min )
    ReFresh: Reducing Memory Access from Exploiting Stable Historical Embeddings for Graph Neural Network Training. (arXiv:2301.07482v1 [cs.LG])
    A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the inevitable shortcuts involved. To address these limitations, we instead propose ReFresh, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, ReFresh is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 4.6x up to 23.6x and reduce the memory access by 64.5% (85.7% higher than a raw feature cache), with less than 1% influence on test accuracy.  ( 2 min )
    Strong inductive biases provably prevent harmless interpolation. (arXiv:2301.07605v1 [stat.ML])
    Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -- a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.  ( 2 min )
    What relations are reliably embeddable in Euclidean space?. (arXiv:1903.05347v3 [cs.LG] UPDATED)
    We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.  ( 2 min )
    Neural Network Quantization for Efficient Inference: A Survey. (arXiv:2112.06126v2 [cs.LG] UPDATED)
    As neural networks have become more powerful, there has been a rising desire to deploy them in the real world; however, the power and accuracy of neural networks is largely due to their depth and complexity, making them difficult to deploy, especially in resource-constrained devices. Neural network quantization has recently arisen to meet this demand of reducing the size and complexity of neural networks by reducing the precision of a network. With smaller and simpler networks, it becomes possible to run neural networks within the constraints of their target hardware. This paper surveys the many neural network quantization techniques that have been developed in the last decade. Based on this survey and comparison of neural network quantization techniques, we propose future directions of research in the area.  ( 2 min )
    Prediction of Red Wine Quality Using One-dimensional Convolutional Neural Networks. (arXiv:2208.14008v2 [cs.LG] UPDATED)
    As an alcoholic beverage, wine has remained prevalent for thousands of years, and the quality assessment of wines has been significant in wine production and trade. Scholars have proposed various deep learning and machine learning algorithms for wine quality prediction, such as Support vector machine (SVM), Random Forest (RF), K-nearest neighbors (KNN), Deep neural network (DNN), and Logistic regression (LR). However, these methods ignore the inner relationship between the physical and chemical properties of the wine, for example, the correlations between pH values, fixed acidity, citric acid, and so on. To fill the gap, this paper conducts the Pearson correlation analysis, PCA analysis, and Shapiro-Wilk test on those properties and incorporates 1D-CNN architecture to capture the correlations among neighboring features. In addition, it implemented dropout and batch normalization techniques to improve the robustness of the proposed model. Massive experiments have shown that our method can outperform baseline approaches in wine quality prediction. Moreover, ablation experiments also demonstrate the effectiveness of incorporating the 1-D CNN module, Dropout, and normalization techniques.  ( 2 min )
    Performance-Preserving Event Log Sampling for Predictive Monitoring. (arXiv:2301.07624v1 [cs.LG])
    Predictive process monitoring is a subfield of process mining that aims to estimate case or event features for running process instances. Such predictions are of significant interest to the process stakeholders. However, most of the state-of-the-art methods for predictive monitoring require the training of complex machine learning models, which is often inefficient. Moreover, most of these methods require a hyper-parameter optimization that requires several repetitions of the training process which is not feasible in many real-life applications. In this paper, we propose an instance selection procedure that allows sampling training process instances for prediction models. We show that our instance selection procedure allows for a significant increase of training speed for next activity and remaining time prediction methods while maintaining reliable levels of prediction accuracy.  ( 2 min )
    Concentration of polynomial random matrices via Efron-Stein inequalities. (arXiv:2209.02655v2 [cs.CC] UPDATED)
    Analyzing concentration of large random matrices is a common task in a wide variety of fields. Given independent random variables, many tools are available to analyze random matrices whose entries are linear in the variables, e.g. the matrix-Bernstein inequality. However, in many applications, we need to analyze random matrices whose entries are polynomials in the variables. These arise naturally in the analysis of spectral algorithms, e.g., Hopkins et al. [STOC 2016], Moitra-Wein [STOC 2019]; and in lower bounds for semidefinite programs based on the Sum of Squares hierarchy, e.g. Barak et al. [FOCS 2016], Jones et al. [FOCS 2021]. In this work, we present a general framework to obtain such bounds, based on the matrix Efron-Stein inequalities developed by Paulin-Mackey-Tropp [Annals of Probability 2016]. The Efron-Stein inequality bounds the norm of a random matrix by the norm of another simpler (but still random) matrix, which we view as arising by "differentiating" the starting matrix. By recursively differentiating, our framework reduces the main task to analyzing far simpler matrices. For Rademacher variables, these simpler matrices are in fact deterministic and hence, analyzing them is far easier. For general non-Rademacher variables, the task reduces to scalar concentration, which is much easier. Moreover, in the setting of polynomial matrices, our results generalize the work of Paulin-Mackey-Tropp. Using our basic framework, we recover known bounds in the literature for simple "tensor networks" and "dense graph matrices". Using our general framework, we derive bounds for "sparse graph matrices", which were obtained only recently by Jones et al. [FOCS 2021] using a nontrivial application of the trace power method, and was a core component in their work. We expect our framework to be helpful for other applications involving concentration phenomena for nonlinear random matrices.  ( 3 min )
    Multimodal Side-Tuning for Document Classification. (arXiv:2301.07502v1 [cs.LG])
    In this paper, we propose to exploit the side-tuning framework for multimodal document classification. Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches. Thanks to this technique it is actually possible to overcome model rigidity and catastrophic forgetting of transfer learning by fine-tuning. The proposed solution uses off-the-shelf deep learning architectures leveraging the side-tuning framework to combine a base model with a tandem of two side networks. We show that side-tuning can be successfully employed also when different data sources are considered, e.g. text and images in document classification. The experimental results show that this approach pushes further the limit for document classification accuracy with respect to the state of the art.  ( 2 min )
    Non-IID Quantum Federated Learning with One-shot Communication Complexity. (arXiv:2209.00768v2 [quant-ph] UPDATED)
    Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients' data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.  ( 2 min )
    Targeted Image Reconstruction by Sampling Pre-trained Diffusion Model. (arXiv:2301.07557v1 [cs.LG])
    A trained neural network model contains information on the training data. Given such a model, malicious parties can leverage the "knowledge" in this model and design ways to print out any usable information (known as model inversion attack). Therefore, it is valuable to explore the ways to conduct a such attack and demonstrate its severity. In this work, we proposed ways to generate a data point of the target class without prior knowledge of the exact target distribution by using a pre-trained diffusion model.  ( 2 min )
    A Novel, Scale-Invariant, Differentiable, Efficient, Scalable Regularizer. (arXiv:2301.07285v1 [cs.LG])
    $L_{p}$-norm regularization schemes such as $L_{0}$, $L_{1}$, and $L_{2}$-norm regularization and $L_{p}$-norm-based regularization techniques such as weight decay and group LASSO compute a quantity which de pends on model weights considered in isolation from one another. This paper describes a novel regularizer which is not based on an $L_{p}$-norm. In contrast with $L_{p}$-norm-based regularization, this regularizer is concerned with the spatial arrangement of weights within a weight matrix. This regularizer is an additive term for the loss function and is differentiable, simple and fast to compute, scale-invariant, requires a trivial amount of additional memory, and can easily be parallelized. Empirically this method yields approximately a one order-of-magnitude improvement in the number of nonzero model parameters at a given level of accuracy.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v3 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that approximates the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations.  ( 2 min )
    Failure Tolerant Training with Persistent Memory Disaggregation over CXL. (arXiv:2301.07492v1 [cs.AR])
    This paper proposes TRAININGCXL that can efficiently process large-scale recommendation datasets in the pool of disaggregated memory while making training fault tolerant with low overhead. To this end, i) we integrate persistent memory (PMEM) and GPU into a cache-coherent domain as Type-2. Enabling CXL allows PMEM to be directly placed in GPU's memory hierarchy, such that GPU can access PMEM without software intervention. TRAININGCXL introduces computing and checkpointing logic near the CXL controller, thereby training data and managing persistency in an active manner. Considering PMEM's vulnerability, ii) we utilize the unique characteristics of recommendation models and take the checkpointing overhead off the critical path of their training. Lastly, iii) TRAININGCXL employs an advanced checkpointing technique that relaxes the updating sequence of model parameters and embeddings across training batches. The evaluation shows that TRAININGCXL achieves 5.2x training performance improvement and 76% energy savings, compared to the modern PMEM-based recommendation systems.  ( 2 min )
    Compression of GPS Trajectories using Autoencoders. (arXiv:2301.07420v1 [cs.LG])
    The ubiquitous availability of mobile devices capable of location tracking led to a significant rise in the collection of GPS data. Several compression methods have been developed in order to reduce the amount of storage needed while keeping the important information. In this paper, we present an lstm-autoencoder based approach in order to compress and reconstruct GPS trajectories, which is evaluated on both a gaming and real-world dataset. We consider various compression ratios and trajectory lengths. The performance is compared to other trajectory compression algorithms, i.e., Douglas-Peucker. Overall, the results indicate that our approach outperforms Douglas-Peucker significantly in terms of the discrete Fr\'echet distance and dynamic time warping. Furthermore, by reconstructing every point lossy, the proposed methodology offers multiple advantages over traditional methods.  ( 2 min )
    No-substitution k-means Clustering with Adversarial Order. (arXiv:2012.14512v2 [cs.DS] UPDATED)
    We investigate $k$-means clustering in the online no-substitution setting when the input arrives in \emph{arbitrary} order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the $k$-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive -- it does not include natural input generated from mixture models. We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity $d$, the algorithm takes $O(d\log(n) k\log(k))$ centers and is an $O(k^3)$-approximation. We also prove that if the data is sampled from a ``natural" distribution, such as a mixture of $k$ Gaussians, then the new complexity measure is equal to $O(k^2\log(n))$. This implies that for data generated from those distributions, our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a $\text{poly}(k)$-approximation. In terms of negative results, we prove that the number of centers needed to achieve an $\alpha$-approximation is at least $\Omega\left(\frac{d}{k\log(n\alpha)}\right)$.  ( 2 min )
    Quantum-inspired tensor network for Earth science. (arXiv:2301.07528v1 [physics.geo-ph])
    Deep Learning (DL) is one of many successful methodologies to extract informative patterns and insights from ever increasing noisy large-scale datasets (in our case, satellite images). However, DL models consist of a few thousand to millions of training parameters, and these training parameters require tremendous amount of electrical power for extracting informative patterns from noisy large-scale datasets (e.g., computationally expensive). Hence, we employ a quantum-inspired tensor network for compressing trainable parameters of physics-informed neural networks (PINNs) in Earth science. PINNs are DL models penalized by enforcing the law of physics; in particular, the law of physics is embedded in DL models. In addition, we apply tensor decomposition to HyperSpectral Images (HSIs) to improve their spectral resolution. A quantum-inspired tensor network is also the native formulation to efficiently represent and train quantum machine learning models on big datasets on GPU tensor cores. Furthermore, the key contribution of this paper is twofold: (I) we reduced a number of trainable parameters of PINNs by using a quantum-inspired tensor network, and (II) we improved the spectral resolution of remotely-sensed images by employing tensor decomposition. As a benchmark PDE, we solved Burger's equation. As practical satellite data, we employed HSIs of Indian Pine, USA and of Pavia University, Italy.  ( 2 min )
    Landscape Complexity for the Empirical Risk of Generalized Linear Models. (arXiv:1912.02143v5 [stat.ML] UPDATED)
    We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v5 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    A Robust Classification Framework for Byzantine-Resilient Stochastic Gradient Descent. (arXiv:2301.07498v1 [cs.LG])
    This paper proposes a Robust Gradient Classification Framework (RGCF) for Byzantine fault tolerance in distributed stochastic gradient descent. The framework consists of a pattern recognition filter which we train to be able to classify individual gradients as Byzantine by using their direction alone. This filter is robust to an arbitrary number of Byzantine workers for convex as well as non-convex optimisation settings, which is a significant improvement on the prior work that is robust to Byzantine faults only when up to 50% of the workers are Byzantine. This solution does not require an estimate of the number of Byzantine workers; its running time is not dependent on the number of workers and can scale up to training instances with a large number of workers without a loss in performance. We validate our solution by training convolutional neural networks on the MNIST dataset in the presence of Byzantine workers.  ( 2 min )
    Autonomous Slalom Maneuver Based on Expert Drivers' Behavior Using Convolutional Neural Network. (arXiv:2301.07424v1 [cs.RO])
    Lane changing and obstacle avoidance are one of the most important tasks in automated cars. To date, many algorithms have been suggested that are generally based on path trajectory or reinforcement learning approaches. Although these methods have been efficient, they are not able to accurately imitate a smooth path traveled by an expert driver. In this paper, a method is presented to mimic drivers' behavior using a convolutional neural network (CNN). First, seven features are extracted from a dataset gathered from four expert drivers in a driving simulator. Then, these features are converted from 1D arrays to 2D arrays and injected into a CNN. The CNN model computes the desired steering wheel angle and sends it to an adaptive PD controller. Finally, the control unit applies proper torque to the steering wheel. Results show that the CNN model can mimic the drivers' behavior with an R2-squared of 0.83. Also, the performance of the presented method was evaluated in the driving simulator for 17 trials, which avoided all traffic cones successfully. In some trials, the presented method performed a smoother maneuver compared to the expert drivers.  ( 2 min )
    A Survey of Advanced Computer Vision Techniques for Sports. (arXiv:2301.07583v1 [cs.CV])
    Computer Vision developments are enabling significant advances in many fields, including sports. Many applications built on top of Computer Vision technologies, such as tracking data, are nowadays essential for every top-level analyst, coach, and even player. In this paper, we survey Computer Vision techniques that can help many sports-related studies gather vast amounts of data, such as Object Detection and Pose Estimation. We provide a use case for such data: building a model for shot speed estimation with pose data obtained using only Computer Vision models. Our model achieves a correlation of 67%. The possibility of estimating shot speeds enables much deeper studies about enabling the creation of new metrics and recommendation systems that will help athletes improve their performance, in any sport. The proposed methodology is easily replicable for many technical movements and is only limited by the availability of video data.  ( 2 min )
    Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification. (arXiv:2301.07609v1 [stat.ML])
    Data-driven approaches coupled with physical knowledge are powerful techniques to model systems. The goal of such models is to efficiently solve for the underlying field by combining measurements with known physical laws. As many systems contain unknown elements, such as missing parameters, noisy data, or incomplete physical laws, this is widely approached as an uncertainty quantification problem. The common techniques to handle all the variables typically depend on the numerical scheme used to approximate the posterior, and it is desirable to have a method which is independent of any such discretization. Information field theory (IFT) provides the tools necessary to perform statistics over fields that are not necessarily Gaussian. We extend IFT to physics-informed IFT (PIFT) by encoding the functional priors with information about the physical laws which describe the field. The posteriors derived from this PIFT remain independent of any numerical scheme and can capture multiple modes, allowing for the solution of problems which are ill-posed. We demonstrate our approach through an analytical example involving the Klein-Gordon equation. We then develop a variant of stochastic gradient Langevin dynamics to draw samples from the joint posterior over the field and model parameters. We apply our method to numerical examples with various degrees of model-form error and to inverse problems involving nonlinear differential equations. As an addendum, the method is equipped with a metric which allows the posterior to automatically quantify model-form uncertainty. Because of this, our numerical experiments show that the method remains robust to even an incorrect representation of the physics given sufficient data. We numerically demonstrate that the method correctly identifies when the physics cannot be trusted, in which case it automatically treats learning the field as a regression problem.  ( 2 min )
    Adaptively Integrated Knowledge Distillation and Prediction Uncertainty for Continual Learning. (arXiv:2301.07316v1 [cs.CV])
    Current deep learning models often suffer from catastrophic forgetting of old knowledge when continually learning new knowledge. Existing strategies to alleviate this issue often fix the trade-off between keeping old knowledge (stability) and learning new knowledge (plasticity). However, the stability-plasticity trade-off during continual learning may need to be dynamically changed for better model performance. In this paper, we propose two novel ways to adaptively balance model stability and plasticity. The first one is to adaptively integrate multiple levels of old knowledge and transfer it to each block level in the new model. The second one uses prediction uncertainty of old knowledge to naturally tune the importance of learning new knowledge during model training. To our best knowledge, this is the first time to connect model prediction uncertainty and knowledge distillation for continual learning. In addition, this paper applies a modified CutMix particularly to augment the data for old knowledge, further alleviating the catastrophic forgetting issue. Extensive evaluations on the CIFAR100 and the ImageNet datasets confirmed the effectiveness of the proposed method for continual learning.  ( 2 min )
    AutoFraudNet: A Multimodal Network to Detect Fraud in the Auto Insurance Industry. (arXiv:2301.07526v1 [cs.LG])
    In the insurance industry detecting fraudulent claims is a critical task with a significant financial impact. A common strategy to identify fraudulent claims is looking for inconsistencies in the supporting evidence. However, this is a laborious and cognitively heavy task for human experts as insurance claims typically come with a plethora of data from different modalities (e.g. images, text and metadata). To overcome this challenge, the research community has focused on multimodal machine learning frameworks that can efficiently reason through multiple data sources. Despite recent advances in multimodal learning, these frameworks still suffer from (i) challenges of joint-training caused by the different characteristics of different modalities and (ii) overfitting tendencies due to high model complexity. In this work, we address these challenges by introducing a multimodal reasoning framework, AutoFraudNet (Automobile Insurance Fraud Detection Network), for detecting fraudulent auto-insurance claims. AutoFraudNet utilizes a cascaded slow fusion framework and state-of-the-art fusion block, BLOCK Tucker, to alleviate the challenges of joint-training. Furthermore, it incorporates a light-weight architectural design along with additional losses to prevent overfitting. Through extensive experiments conducted on a real-world dataset, we demonstrate: (i) the merits of multimodal approaches, when compared to unimodal and bimodal methods, and (ii) the effectiveness of AutoFraudNet in fusing various modalities to boost performance (over 3\% in PR AUC).  ( 2 min )
    Training Semantic Segmentation on Heterogeneous Datasets. (arXiv:2301.07634v1 [cs.CV])
    We explore semantic segmentation beyond the conventional, single-dataset homogeneous training and bring forward the problem of Heterogeneous Training of Semantic Segmentation (HTSS). HTSS involves simultaneous training on multiple heterogeneous datasets, i.e. datasets with conflicting label spaces and different (weak) annotation types from the perspective of semantic segmentation. The HTSS formulation exposes deep networks to a larger and previously unexplored aggregation of information that can potentially enhance semantic segmentation in three directions: i) performance: increased segmentation metrics on seen datasets, ii) generalization: improved segmentation metrics on unseen datasets, and iii) knowledgeability: increased number of recognizable semantic concepts. To research these benefits of HTSS, we propose a unified framework, that incorporates heterogeneous datasets in a single-network training pipeline following the established FCN standard. Our framework first curates heterogeneous datasets to bring them into a common format and then trains a single-backbone FCN on all of them simultaneously. To achieve this, it transforms weak annotations, which are incompatible with semantic segmentation, to per-pixel labels, and hierarchizes their label spaces into a universal taxonomy. The trained HTSS models demonstrate performance and generalization gains over a wide range of datasets and extend the inference label space entailing hundreds of semantic classes.  ( 2 min )
    Curvilinear object segmentation in medical images based on ODoS filter and deep learning network. (arXiv:2301.07475v1 [eess.IV])
    Automatic segmentation of curvilinear objects in medical images plays an important role in the diagnosis and evaluation of human diseases, yet it is a challenging uncertainty for the complex segmentation task due to different issues like various image appearance, low contrast between curvilinear objects and their surrounding backgrounds, thin and uneven curvilinear structures, and improper background illumination. To overcome these challenges, we present a unique curvilinear structure segmentation framework based on oriented derivative of stick (ODoS) filter and deep learning network for curvilinear object segmentation in medical images. Currently, a large number of deep learning models emphasis on developing deep architectures and ignore capturing the structural features of curvature objects, which may lead to unsatisfactory results. In consequence, a new approach that incorporates the ODoS filter as part of a deep learning network is presented to improve the spatial attention of curvilinear objects. In which, the original image is considered as principal part to describe various image appearance and complex background illumination, the multi-step strategy is used to enhance contrast between curvilinear objects and their surrounding backgrounds, and the vector field is applied to discriminate thin and uneven curvilinear structures. Subsequently, a deep learning framework is employed to extract varvious structural features for curvilinear object segmentation in medical images. The performance of the computational model was validated in experiments with publicly available DRIVE, STARE and CHASEDB1 datasets. Experimental results indicate that the presented model has yielded surprising results compared with some state-of-the-art methods.  ( 2 min )
    Model-free machine learning of conservation laws from data. (arXiv:2301.07503v1 [cs.LG])
    We present a machine learning based method for learning first integrals of systems of ordinary differential equations from given trajectory data. The method is model-free in that it does not require explicit knowledge of the underlying system of differential equations that generated the trajectories. As a by-product, once the first integrals have been learned, also the system of differential equations will be known. We illustrate our method by considering several classical problems from the mathematical sciences.  ( 2 min )
    Machine learning techniques for the Schizophrenia diagnosis: A comprehensive review and future research directions. (arXiv:2301.07496v1 [cs.LG])
    Schizophrenia (SCZ) is a brain disorder where different people experience different symptoms, such as hallucination, delusion, flat-talk, disorganized thinking, etc. In the long term, this can cause severe effects and diminish life expectancy by more than ten years. Therefore, early and accurate diagnosis of SCZ is prevalent, and modalities like structural magnetic resonance imaging (sMRI), functional MRI (fMRI), diffusion tensor imaging (DTI), and electroencephalogram (EEG) assist in witnessing the brain abnormalities of the patients. Moreover, for accurate diagnosis of SCZ, researchers have used machine learning (ML) algorithms for the past decade to distinguish the brain patterns of healthy and SCZ brains using MRI and fMRI images. This paper seeks to acquaint SCZ researchers with ML and to discuss its recent applications to the field of SCZ study. This paper comprehensively reviews state-of-the-art techniques such as ML classifiers, artificial neural network (ANN), deep learning (DL) models, methodological fundamentals, and applications with previous studies. The motivation of this paper is to benefit from finding the research gaps that may lead to the development of a new model for accurate SCZ diagnosis. The paper concludes with the research finding, followed by the future scope that directly contributes to new research directions.  ( 2 min )
    Local Learning with Neuron Groups. (arXiv:2301.07635v1 [cs.LG])
    Traditional deep network training methods optimize a monolithic objective function jointly for all the components. This can lead to various inefficiencies in terms of potential parallelization. Local learning is an approach to model-parallelism that removes the standard end-to-end learning setup and utilizes local objective functions to permit parallel learning amongst model components in a deep network. Recent works have demonstrated that variants of local learning can lead to efficient training of modern deep networks. However, in terms of how much computation can be distributed, these approaches are typically limited by the number of layers in a network. In this work we propose to study how local learning can be applied at the level of splitting layers or modules into sub-components, adding a notion of width-wise modularity to the existing depth-wise modularity associated with local learning. We investigate local-learning penalties that permit such models to be trained efficiently. Our experiments on the CIFAR-10, CIFAR-100, and Imagenet32 datasets demonstrate that introducing width-level modularity can lead to computational advantages over existing methods based on local learning and opens new opportunities for improved model-parallel distributed training. Code is available at: https://github.com/adeetyapatel12/GN-DGL.  ( 2 min )
    Human-Timescale Adaptation in an Open-Ended Task Space. (arXiv:2301.07608v1 [cs.LG])
    Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.  ( 2 min )
    Synthcity: facilitating innovative use cases of synthetic data in different data modalities. (arXiv:2301.07573v1 [cs.LG])
    Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation across diverse tabular data modalities, including static data, regular and irregular time series, data with censoring, multi-source data, composite data, and more. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data. It also offers the community a playground for rapid experimentation and prototyping, a one-stop-shop for SOTA benchmarks, and an opportunity for extending research impact. The library can be accessed on GitHub (https://github.com/vanderschaarlab/synthcity) and pip (https://pypi.org/project/synthcity/). We warmly invite the community to join the development effort by providing feedback, reporting bugs, and contributing code.  ( 2 min )
    Reslicing Ultrasound Images for Data Augmentation and Vessel Reconstruction. (arXiv:2301.07286v1 [eess.IV])
    Robot-guided catheter insertion has the potential to deliver urgent medical care in situations where medical personnel are unavailable. However, this technique requires accurate and reliable segmentation of anatomical landmarks in the body. For the ultrasound imaging modality, obtaining large amounts of training data for a segmentation model is time-consuming and expensive. This paper introduces RESUS (RESlicing of UltraSound Images), a weak supervision data augmentation technique for ultrasound images based on slicing reconstructed 3D volumes from tracked 2D images. This technique allows us to generate views which cannot be easily obtained in vivo due to physical constraints of ultrasound imaging, and use these augmented ultrasound images to train a semantic segmentation model. We demonstrate that RESUS achieves statistically significant improvement over training with non-augmented images and highlight qualitative improvements through vessel reconstruction.  ( 2 min )
    Learning a Formality-Aware Japanese Sentence Representation. (arXiv:2301.07209v1 [cs.CL])
    While the way intermediate representations are generated in encoder-decoder sequence-to-sequence models typically allow them to preserve the semantics of the input sentence, input features such as formality might be left out. On the other hand, downstream tasks such as translation would benefit from working with a sentence representation that preserves formality in addition to semantics, so as to generate sentences with the appropriate level of social formality -- the difference between speaking to a friend versus speaking with a supervisor. We propose a sequence-to-sequence method for learning a formality-aware representation for Japanese sentences, where sentence generation is conditioned on both the original representation of the input sentence, and a side constraint which guides the sentence representation towards preserving formality information. Additionally, we propose augmenting the sentence representation with a learned representation of formality which facilitates the extraction of formality in downstream tasks. We address the lack of formality-annotated parallel data by adapting previous works on procedural formality classification of Japanese sentences. Experimental results suggest that our techniques not only helps the decoder recover the formality of the input sentence, but also slightly improves the preservation of input sentence semantics.  ( 2 min )
    Efficient correlation-based discretization of continuous variables for annealing machines. (arXiv:2301.07244v1 [quant-ph])
    Annealing machines specialized for combinatorial optimization problems have been developed, and some companies offer services to use those machines. Such specialized machines can only handle binary variables, and their input format is the quadratic unconstrained binary optimization (QUBO) formulation. Therefore, discretization is necessary to solve problems with continuous variables. However, there is a severe constraint on the number of binary variables with such machines. Although the simple binary expansion in the previous research requires many binary variables, we need to reduce the number of such variables in the QUBO formulation due to the constraint. We propose a discretization method that involves using correlations of continuous variables. We numerically show that the proposed method reduces the number of necessary binary variables in the QUBO formulation without a significant loss in prediction accuracy.  ( 2 min )
    Discrete Latent Structure in Neural Networks. (arXiv:2301.07473v1 [cs.LG])
    Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.  ( 2 min )
    DDPEN: Trajectory Optimisation With Sub Goal Generation Model. (arXiv:2301.07433v1 [cs.RO])
    Differential dynamic programming (DDP) is a widely used and powerful trajectory optimization technique, however, due to its internal structure, it is not exempt from local minima. In this paper, we present Differential Dynamic Programming with Escape Network (DDPEN) - a novel approach to avoid DDP local minima by utilising an additional term used in the optimization criteria pointing towards the direction where robot should move in order to escape local minima. In order to produce the aforementioned directions, we propose to utilize a deep model that takes as an input the map of the environment in the form of a costmap together with the desired goal position. The Model produces possible future directions that will lead to the goal, avoiding local minima which is possible to run in real time conditions. The model is trained on a synthetic dataset and overall the system is evaluated at the Gazebo simulator. In this work we show that our proposed method allows avoiding local minima of trajectory optimization algorithm and successfully execute a trajectory 278 m long with various convex and nonconvex obstacles.  ( 2 min )
    DIRECT: Learning from Sparse and Shifting Rewards using Discriminative Reward Co-Training. (arXiv:2301.07421v1 [cs.LG])
    We propose discriminative reward co-training (DIRECT) as an extension to deep reinforcement learning algorithms. Building upon the concept of self-imitation learning (SIL), we introduce an imitation buffer to store beneficial trajectories generated by the policy determined by their return. A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies. The discriminator's verdict is used to construct a reward signal for optimizing the policy. By interpolating prior experience, DIRECT is able to act as a surrogate, steering policy optimization towards more valuable regions of the reward landscape thus learning an optimal policy. Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments being able to provide a surrogate reward to the policy and direct the optimization towards valuable areas.  ( 2 min )
    Image Embedding for Denoising Generative Models. (arXiv:2301.07485v1 [cs.CV])
    Denoising Diffusion models are gaining increasing popularity in the field of generative modeling for several reasons, including the simple and stable training, the excellent generative quality, and the solid probabilistic foundation. In this article, we address the problem of {\em embedding} an image into the latent space of Denoising Diffusion Models, that is finding a suitable ``noisy'' image whose denoising results in the original image. We particularly focus on Denoising Diffusion Implicit Models due to the deterministic nature of their reverse diffusion process. As a side result of our investigation, we gain a deeper insight into the structure of the latent space of diffusion models, opening interesting perspectives on its exploration, the definition of semantic trajectories, and the manipulation/conditioning of encodings for editing purposes. A particularly interesting property highlighted by our research, which is also characteristic of this class of generative models, is the independence of the latent representation from the networks implementing the reverse diffusion process. In other words, a common seed passed to different networks (each trained on the same dataset), eventually results in identical images.  ( 2 min )
    Optimistic Dynamic Regret Bounds. (arXiv:2301.07530v1 [cs.LG])
    Online Learning (OL) algorithms have originally been developed to guarantee good performances when comparing their output to the best fixed strategy. The question of performance with respect to dynamic strategies remains an active research topic. We develop in this work dynamic adaptations of classical OL algorithms based on the use of experts' advice and the notion of optimism. We also propose a constructivist method to generate those advices and eventually provide both theoretical and experimental guarantees for our procedures.  ( 2 min )
    PIRLNav: Pretraining with Imitation and RL Finetuning for ObjectNav. (arXiv:2301.07302v1 [cs.LG])
    We study ObjectGoal Navigation - where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) on a dataset of human demonstrations achieves promising results. However, this has limitations $-$ 1) IL policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present a two-stage learning scheme for IL pretraining on human demonstrations followed by RL-finetuning. This leads to a PIRLNav policy that advances the state-of-the-art on ObjectNav from $60.0\%$ success rate to $65.0\%$ ($+5.0\%$ absolute). Using this IL$\rightarrow$RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with `free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that IL$\rightarrow$RL on human demonstrations outperforms IL$\rightarrow$RL on SP and FE trajectories, even when controlled for the same IL-pretraining success on TRAIN, and even on a subset of VAL episodes where IL-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the IL pretraining dataset. We find that as we increase the size of the IL-pretraining dataset and get to high IL accuracies, the improvements from RL-finetuning are smaller, and that $90\%$ of the performance of our best IL$\rightarrow$RL policy can be achieved with less than half the number of IL demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them.  ( 2 min )
    Beating the Best: Improving on AlphaFold2 at Protein Structure Prediction. (arXiv:2301.07568v1 [q-bio.BM])
    The goal of Protein Structure Prediction (PSP) problem is to predict a protein's 3D structure (confirmation) from its amino acid sequence. The problem has been a 'holy grail' of science since the Noble prize-winning work of Anfinsen demonstrated that protein conformation was determined by sequence. A recent and important step towards this goal was the development of AlphaFold2, currently the best PSP method. AlphaFold2 is probably the highest profile application of AI to science. Both AlphaFold2 and RoseTTAFold (another impressive PSP method) have been published and placed in the public domain (code & models). Stacking is a form of ensemble machine learning ML in which multiple baseline models are first learnt, then a meta-model is learnt using the outputs of the baseline level model to form a model that outperforms the base models. Stacking has been successful in many applications. We developed the ARStack PSP method by stacking AlphaFold2 and RoseTTAFold. ARStack significantly outperforms AlphaFold2. We rigorously demonstrate this using two sets of non-homologous proteins, and a test set of protein structures published after that of AlphaFold2 and RoseTTAFold. As more high quality prediction methods are published it is likely that ensemble methods will increasingly outperform any single method.  ( 2 min )
    Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (arXiv:2301.07487v1 [cs.LG])
    Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.  ( 2 min )
    Learning Deformation Trajectories of Boltzmann Densities. (arXiv:2301.07388v1 [stat.ML])
    We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = (|x|/\sigma)^p$. This then induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along this family of densities. Concretely, this condition can be translated to a PDE between $V_t$ and $f_t$ and we minimize the amount by which this PDE fails to hold. We compare this objective to the reverse KL-divergence on Gaussian mixtures and on the $\phi^4$ lattice field theory on a circle.  ( 2 min )
    Threats, Vulnerabilities, and Controls of Machine Learning Based Systems: A Survey and Taxonomy. (arXiv:2301.07474v1 [cs.CR])
    In this article, we propose the Artificial Intelligence Security Taxonomy to systematize the knowledge of threats, vulnerabilities, and security controls of ML-based systems. We first classify the damage caused by attacks against ML-based systems, define ML-specific security, and discuss its characteristics. Next, we enumerate all relevant assets and stakeholders and provide a general taxonomy for ML-specific threats. Then, we collect a wide range of security controls against ML-specific threats through an extensive review of recent literature. Finally, we classify the vulnerabilities and controls of an ML-based system in terms of each vulnerable asset in the system's entire lifecycle.  ( 2 min )
    PTA-Det: Point Transformer Associating Point cloud and Image for 3D Object Detection. (arXiv:2301.07301v1 [cs.CV])
    In autonomous driving, 3D object detection based on multi-modal data has become an indispensable approach when facing complex environments around the vehicle. During multi-modal detection, LiDAR and camera are simultaneously applied for capturing and modeling. However, due to the intrinsic discrepancies between the LiDAR point and camera image, the fusion of the data for object detection encounters a series of problems. Most multi-modal detection methods perform even worse than LiDAR-only methods. In this investigation, we propose a method named PTA-Det to improve the performance of multi-modal detection. Accompanied by PTA-Det, a Pseudo Point Cloud Generation Network is proposed, which can convert image information including texture and semantic features by pseudo points. Thereafter, through a transformer-based Point Fusion Transition (PFT) module, the features of LiDAR points and pseudo points from image can be deeply fused under a unified point-based representation. The combination of these modules can conquer the major obstacle in feature fusion across modalities and realizes a complementary and discriminative representation for proposal generation. Extensive experiments on the KITTI dataset show the PTA-Det achieves a competitive result and support its effectiveness.  ( 2 min )
    Causal Falsification of Digital Twins. (arXiv:2301.07210v1 [stat.ME])
    Digital twins hold substantial promise in many applications, but rigorous procedures for assessing their accuracy are essential for their widespread deployment in safety-critical settings. By formulating this task within the framework of causal inference, we show it is not possible to certify that a twin is "correct" using real-world observational data unless potentially tenuous assumptions are made about the data-generating process. To avoid these assumptions, we propose an assessment strategy that instead aims to find cases where the twin is not correct, and present a general-purpose statistical procedure for doing so that may be used across a wide variety of applications and twin models. Our approach yields reliable and actionable information about the twin under only the assumption of an i.i.d. dataset of real-world observations, and in particular remains sound even in the presence of arbitrary unmeasured confounding. We demonstrate the effectiveness of our methodology via a large-scale case study involving sepsis modelling within the Pulse Physiology Engine, which we assess using the MIMIC-III dataset of ICU patients.  ( 2 min )
    Improve Noise Tolerance of Robust Loss via Noise-Awareness. (arXiv:2301.07306v1 [cs.LG])
    Robust loss minimization is an important strategy for handling robust learning issue on noisy labels. Current robust losses, however, inevitably involve hyperparameters to be tuned for different datasets with noisy labels, manually or heuristically through cross validation, which makes them fairly hard to be generally applied in practice. Existing robust loss methods usually assume that all training samples share common hyperparameters, which are independent of instances. This limits the ability of these methods on distinguishing individual noise properties of different samples, making them hardly adapt to different noise structures. To address above issues, we propose to assemble robust loss with instance-dependent hyperparameters to improve their noise-tolerance with theoretical guarantee. To achieve setting such instance-dependent hyperparameters for robust loss, we propose a meta-learning method capable of adaptively learning a hyperparameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster). Specifically, through mutual amelioration between hyperparameter prediction function and classifier parameters in our method, both of them can be simultaneously finely ameliorated and coordinated to attain solutions with good generalization capability. Four kinds of SOTA robust losses are attempted to be integrated with our algorithm, and experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and generalization performance. Meanwhile, the explicit parameterized structure makes the meta-learned prediction function capable of being readily transferrable and plug-and-play to unseen datasets with noisy labels. Specifically, we transfer our meta-learned NARL-Adjuster to unseen tasks, including several real noisy datasets, and achieve better performance compared with conventional hyperparameter tuning strategy.  ( 2 min )
    Tailor: Altering Skip Connections for Resource-Efficient Inference. (arXiv:2301.07247v1 [cs.CV])
    Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this paper, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network's skip connections are needed for the network to learn, they can later be removed or shortened to provide a more hardware efficient implementation with minimal to no accuracy loss. We introduce Tailor, a codesign tool whose hardware-aware training algorithm gradually removes or shortens a fully trained network's skip connections to lower their hardware cost. The optimized hardware designs improve resource utilization by up to 34% for BRAMs, 13% for FFs, and 16% for LUTs.  ( 2 min )
    Detecting and Ranking Causal Anomalies in End-to-End Complex System. (arXiv:2301.07281v1 [cs.LG])
    With the rapid development of technology, the automated monitoring systems of large-scale factories are becoming more and more important. By collecting a large amount of machine sensor data, we can have many ways to find anomalies. We believe that the real core value of an automated monitoring system is to identify and track the cause of the problem. The most famous method for finding causal anomalies is RCA, but there are many problems that cannot be ignored. They used the AutoRegressive eXogenous (ARX) model to create a time-invariant correlation network as a machine profile, and then use this profile to track the causal anomalies by means of a method called fault propagation. There are two major problems in describing the behavior of a machine by using the correlation network established by ARX: (1) It does not take into account the diversity of states (2) It does not separately consider the correlations with different time-lag. Based on these problems, we propose a framework called Ranking Causal Anomalies in End-to-End System (RCAE2E), which completely solves the problems mentioned above. In the experimental part, we use synthetic data and real-world large-scale photoelectric factory data to verify the correctness and existence of our method hypothesis.  ( 2 min )
    Towards Models that Can See and Read. (arXiv:2301.07389v1 [cs.CV])
    Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite the obvious resemblance between them, the two are treated independently, yielding task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on VQA and CAP by up to 3.49% and 0.7 CIDEr, respectively.  ( 2 min )
    Relativistic Digital Twin: Bringing the IoT to the Future. (arXiv:2301.07390v1 [cs.NI])
    Complex IoT ecosystems often require the usage of Digital Twins (DTs) of their physical assets in order to perform predictive analytics and simulate what-if scenarios. DTs are able to replicate IoT devices and adapt over time to their behavioral changes. However, DTs in IoT are typically tailored to a specific use case, without the possibility to seamlessly adapt to different scenarios. Further, the fragmentation of IoT poses additional challenges on how to deploy DTs in heterogeneous scenarios characterized by the usage of multiple data formats and IoT network protocols. In this paper, we propose the Relativistic Digital Twin (RDT) framework, through which we automatically generate general purpose DTs of IoT entities and tune their behavioral models over time by constantly observing their real counterparts. The framework relies on the object representation via the Web of Things (WoT), to offer a standardized interface to each of the IoT devices as well as to their DTs. To this purpose, we extended the W3C WoT standard in order to encompass the concept of behavioral model and define it in the Thing Description (TD) through a new vocabulary. Finally, we evaluated the RDT framework over two disjoint use cases to assess its correctness and learning performance, i.e. the DT of a simulated smart home scenario with the capability of forecasting the indoor temperature, and the DT of a real-world drone with the capability of forecasting its trajectory in an outdoor scenario.  ( 2 min )
    Complexity Analysis of a Countable-armed Bandit Problem. (arXiv:2301.07243v1 [cs.LG])
    We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.  ( 2 min )
    Label Inference Attack against Split Learning under Regression Setting. (arXiv:2301.07284v1 [cs.CR])
    As a crucial building block in vertical Federated Learning (vFL), Split Learning (SL) has demonstrated its practice in the two-party model training collaboration, where one party holds the features of data samples and another party holds the corresponding labels. Such method is claimed to be private considering the shared information is only the embedding vectors and gradients instead of private raw data and labels. However, some recent works have shown that the private labels could be leaked by the gradients. These existing attack only works under the classification setting where the private labels are discrete. In this work, we step further to study the leakage in the scenario of the regression model, where the private labels are continuous numbers (instead of discrete labels in classification). This makes previous attacks harder to infer the continuous labels due to the unbounded output range. To address the limitation, we propose a novel learning-based attack that integrates gradient information and extra learning regularization objectives in aspects of model training properties, which can infer the labels under regression settings effectively. The comprehensive experiments on various datasets and models have demonstrated the effectiveness of our proposed attack. We hope our work can pave the way for future analyses that make the vFL framework more secure.  ( 2 min )
    Adapting Multilingual Speech Representation Model for a New, Underresourced Language through Multilingual Fine-tuning and Continued Pretraining. (arXiv:2301.07295v1 [cs.CL])
    In recent years, neural models learned through self-supervised pretraining on large scale multilingual text or speech data have exhibited promising results for underresourced languages, especially when a relatively large amount of data from related language(s) is available. While the technology has a potential for facilitating tasks carried out in language documentation projects, such as speech transcription, pretraining a multilingual model from scratch for every new language would be highly impractical. We investigate the possibility for adapting an existing multilingual wav2vec 2.0 model for a new language, focusing on actual fieldwork data from a critically endangered tongue: Ainu. Specifically, we (i) examine the feasibility of leveraging data from similar languages also in fine-tuning; (ii) verify whether the model's performance can be improved by further pretraining on target language data. Our results show that continued pretraining is the most effective method to adapt a wav2vec 2.0 model for a new language and leads to considerable reduction in error rates. Furthermore, we find that if a model pretrained on a related speech variety or an unrelated language with similar phonological characteristics is available, multilingual fine-tuning using additional data from that language can have positive impact on speech recognition performance when there is very little labeled data in the target language.  ( 2 min )
    A variational autoencoder-based nonnegative matrix factorisation model for deep dictionary learning. (arXiv:2301.07272v1 [cs.LG])
    Construction of dictionaries using nonnegative matrix factorisation (NMF) has extensive applications in signal processing and machine learning. With the advances in deep learning, training compact and robust dictionaries using deep neural networks, i.e., dictionaries of deep features, has been proposed. In this study, we propose a probabilistic generative model which employs a variational autoencoder (VAE) to perform nonnegative dictionary learning. In contrast to the existing VAE models, we cast the model under a statistical framework with latent variables obeying a Gamma distribution and design a new loss function to guarantee the nonnegative dictionaries. We adopt an acceptance-rejection sampling reparameterization trick to update the latent variables iteratively. We apply the dictionaries learned from VAE-NMF to two signal processing tasks, i.e., enhancement of speech and extraction of muscle synergies. Experimental results demonstrate that VAE-NMF performs better in learning the latent nonnegative dictionaries in comparison with state-of-the-art methods.  ( 2 min )
    Tracking Brand-Associated Polarity-Bearing Topics in User Reviews. (arXiv:2301.07183v1 [cs.IR])
    Monitoring online customer reviews is important for business organisations to measure customer satisfaction and better manage their reputations. In this paper, we propose a novel dynamic Brand-Topic Model (dBTM) which is able to automatically detect and track brand-associated sentiment scores and polarity-bearing topics from product reviews organised in temporally-ordered time intervals. dBTM models the evolution of the latent brand polarity scores and the topic-word distributions over time by Gaussian state space models. It also incorporates a meta learning strategy to control the update of the topic-word distribution in each time interval in order to ensure smooth topic transitions and better brand score predictions. It has been evaluated on a dataset constructed from MakeupAlley reviews and a hotel review dataset. Experimental results show that dBTM outperforms a number of competitive baselines in brand ranking, achieving a good balance of topic coherence and uniqueness, and extracting well-separated polarity-bearing topics across time intervals.  ( 2 min )
    Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. (arXiv:2301.07206v1 [stat.ML])
    Relating a set of variables X to a response y is crucial in chemometrics. A quantitative prediction objective can be enriched by qualitative data interpretation, for instance by locating the most influential features. When high-dimensional problems arise, dimension reduction techniques can be used. Most notable are projections (e.g. Partial Least Squares or PLS ) or variable selections (e.g. lasso). Sparse partial least squares combine both strategies, by blending variable selection into PLS. The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. It is based on penalizations inspired by classical regression methods (lasso, group lasso, least squares, ridge) and uses the dual norm notion. The resulting sparsity is enforced by an intuitive shrinking ratio parameter. Dual-sPLS favorably compares to similar regression methods, on simulated and real chemical data. Code is provided as an open-source package in R: \url{https://CRAN.R-project.org/package=dual.spls}.  ( 2 min )
    Scaffold-Based Multi-Objective Drug Candidate Optimization. (arXiv:2301.07175v1 [q-bio.BM])
    Multiparameter optimization (MPO) provides a means to assess and balance several variables based on their importance to the overall objective. However, using MPO methods in therapeutic discovery is challenging due to the number of cheminformatics properties required to find an optimal solution. High throughput virtual screening to identify hit candidates produces a large amount of data with conflicting properties. For instance, toxicity and binding affinity can contradict each other and cause improbable levels of toxicity that can lead to adverse effects. Instead of using the exhaustive method of treating each property, multiple properties can be combined into a single MPO score, with weights assigned for each property. This desirability score also lends itself well to ML applications that can use the score in the loss function. In this work, we will discuss scaffold focused graph-based Markov chain monte carlo framework built to generate molecules with optimal properties. This framework trains itself on-the-fly with the MPO score of each iteration of molecules, and is able to work on a greater number of properties and sample the chemical space around a starting scaffold. Results are compared to the chemical Transformer model molGCT to judge performance between graph and natural language processing approaches.  ( 2 min )
    Artificial Neuronal Ensembles with Learned Context Dependent Gating. (arXiv:2301.07187v1 [cs.LG])
    Biological neural networks are capable of recruiting different sets of neurons to encode different memories. However, when training artificial neural networks on a set of tasks, typically, no mechanism is employed for selectively producing anything analogous to these neuronal ensembles. Further, artificial neural networks suffer from catastrophic forgetting, where the network's performance rapidly deteriorates as tasks are learned sequentially. By contrast, sequential learning is possible for a range of biological organisms. We introduce Learned Context Dependent Gating (LXDG), a method to flexibly allocate and recall `artificial neuronal ensembles', using a particular network structure and a new set of regularization terms. Activities in the hidden layers of the network are modulated by gates, which are dynamically produced during training. The gates are outputs of networks themselves, trained with a sigmoid output activation. The regularization terms we have introduced correspond to properties exhibited by biological neuronal ensembles. The first term penalizes low gate sparsity, ensuring that only a specified fraction of the network is used. The second term ensures that previously learned gates are recalled when the network is presented with input from previously learned tasks. Finally, there is a regularization term responsible for ensuring that new tasks are encoded in gates that are as orthogonal as possible from previously used ones. We demonstrate the ability of this method to alleviate catastrophic forgetting on continual learning benchmarks. When the new regularization terms are included in the model along with Elastic Weight Consolidation (EWC) it achieves better performance on the benchmark `permuted MNIST' than with EWC alone. The benchmark `rotated MNIST' demonstrates how similar tasks recruit similar neurons to the artificial neuronal ensemble.  ( 2 min )
    Revisiting mass-radius relationships for exoplanet populations: a machine learning insight. (arXiv:2301.07143v1 [astro-ph.EP])
    The growing number of exoplanet discoveries and advances in machine learning techniques allow us to find, explore, and understand characteristics of these new worlds beyond our Solar System. We analyze the dataset of 762 confirmed exoplanets and eight Solar System planets using efficient machine-learning approaches to characterize their fundamental quantities. By adopting different unsupervised clustering algorithms, the data are divided into two main classes: planets with $\log R_{p}\leq0.91R_{\oplus}$ and $\log M_{p}\leq1.72M_{\oplus}$ as class 1 and those with $\log R_{p}>0.91R_{\oplus}$ and $\log M_{p}>1.72M_{\oplus}$ as class 2. Various regression models are used to reveal correlations between physical parameters and evaluate their performance. We find that planetary mass, orbital period, and stellar mass play preponderant roles in predicting exoplanet radius. The validation metrics (RMSE, MAE, and $R^{2}$) suggest that the Support Vector Regression has, by and large, better performance than other models and is a promising model for obtaining planetary radius. Not only do we improve the prediction accuracy in logarithmic space, but also we derive parametric equations using the M5P and Markov Chain Monte Carlo methods. Planets of class 1 are shown to be consistent with a positive linear mass-radius relation, while for planets of class 2, the planetary radius represents a strong correlation with their host stars' masses.  ( 2 min )
    Heterogeneous Multi-Robot Reinforcement Learning. (arXiv:2301.07137v1 [cs.RO])
    Cooperative multi-robot tasks can benefit from heterogeneity in the robots' physical and behavioral traits. In spite of this, traditional Multi-Agent Reinforcement Learning (MARL) frameworks lack the ability to explicitly accommodate policy heterogeneity, and typically constrain agents to share neural network parameters. This enforced homogeneity limits application in cases where the tasks benefit from heterogeneous behaviors. In this paper, we crystallize the role of heterogeneity in MARL policies. Towards this end, we introduce Heterogeneous Graph Neural Network Proximal Policy Optimization (HetGPPO), a paradigm for training heterogeneous MARL policies that leverages a Graph Neural Network for differentiable inter-agent communication. HetGPPO allows communicating agents to learn heterogeneous behaviors while enabling fully decentralized training in partially observable environments. We complement this with a taxonomical overview that exposes more heterogeneity classes than previously identified. To motivate the need for our model, we present a characterization of techniques that homogeneous models can leverage to emulate heterogeneous behavior, and show how this "apparent heterogeneity" is brittle in real-world conditions. Through simulations and real-world experiments, we show that: (i) when homogeneous methods fail due to strong heterogeneous requirements, HetGPPO succeeds, and, (ii) when homogeneous methods are able to learn apparently heterogeneous behaviors, HetGPPO achieves higher resilience to both training and deployment noise.  ( 2 min )
    Mortality Prediction with Adaptive Feature Importance Recalibration for Peritoneal Dialysis Patients: a deep-learning-based study on a real-world longitudinal follow-up dataset. (arXiv:2301.07107v1 [cs.LG])
    Objective: Peritoneal Dialysis (PD) is one of the most widely used life-supporting therapies for patients with End-Stage Renal Disease (ESRD). Predicting mortality risk and identifying modifiable risk factors based on the Electronic Medical Records (EMR) collected along with the follow-up visits are of great importance for personalized medicine and early intervention. Here, our objective is to develop a deep learning model for a real-time, individualized, and interpretable mortality prediction model - AICare. Method and Materials: Our proposed model consists of a multi-channel feature extraction module and an adaptive feature importance recalibration module. AICare explicitly identifies the key features that strongly indicate the outcome prediction for each patient to build the health status embedding individually. This study has collected 13,091 clinical follow-up visits and demographic data of 656 PD patients. To verify the application universality, this study has also collected 4,789 visits of 1,363 hemodialysis dialysis (HD) as an additional experiment dataset to test the prediction performance, which will be discussed in the Appendix. Results: 1) Experiment results show that AICare achieves 81.6%/74.3% AUROC and 47.2%/32.5% AUPRC for the 1-year mortality prediction task on PD/HD dataset respectively, which outperforms the state-of-the-art comparative deep learning models. 2) This study first provides a comprehensive elucidation of the relationship between the causes of mortality in patients with PD and clinical features based on an end-to-end deep learning model. 3) This study first reveals the pattern of variation in the importance of each feature in the mortality prediction based on built-in interpretability. 4) We develop a practical AI-Doctor interaction system to visualize the trajectory of patients' health status and risk indicators.  ( 3 min )
    Genetic Imitation Learning by Reward Extrapolation. (arXiv:2301.07182v1 [cs.NE])
    Imitation learning demonstrates remarkable performance in various domains. However, imitation learning is also constrained by many prerequisites. The research community has done intensive research to alleviate these constraints, such as adding the stochastic policy to avoid unseen states, eliminating the need for action labels, and learning from the suboptimal demonstrations. Inspired by the natural reproduction process, we proposed a method called GenIL that integrates the Genetic Algorithm with imitation learning. The involvement of the Genetic Algorithm improves the data efficiency by reproducing trajectories with various returns and assists the model in estimating more accurate and compact reward function parameters. We tested GenIL in both Atari and Mujoco domains, and the result shows that it successfully outperforms the previous extrapolation methods over extrapolation accuracy, robustness, and overall policy performance when input data is limited.  ( 2 min )
    A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles. (arXiv:2301.07156v1 [cs.LG])
    In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a pre-processing for the road network graph to handle the constrained combinatorial optimization problem in an efficient way. Then, for the pre-processed graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in country-sized road networks, with simulation experiments in Norway, Sweden and Finland.  ( 2 min )
    Large Deviations for Classification Performance Analysis of Machine Learning Systems. (arXiv:2301.07104v1 [cs.LG])
    We study the performance of machine learning binary classification techniques in terms of error probabilities. The statistical test is based on the Data-Driven Decision Function (D3F), learned in the training phase, i.e., what is thresholded before the final binary decision is made. Based on large deviations theory, we show that under appropriate conditions the classification error probabilities vanish exponentially, as $\sim \exp\left(-n\,I + o(n) \right)$, where $I$ is the error rate and $n$ is the number of observations available for testing. We also propose two different approximations for the error probability curves, one based on a refined asymptotic formula (often referred to as exact asymptotics), and another one based on the central limit theorem. The theoretical findings are finally tested using the popular MNIST dataset.  ( 2 min )
    Continuous Trajectory Generation Based on Two-Stage GAN. (arXiv:2301.07103v1 [cs.LG])
    Simulating the human mobility and generating large-scale trajectories are of great use in many real-world applications, such as urban planning, epidemic spreading analysis, and geographic privacy protect. Although many previous works have studied the problem of trajectory generation, the continuity of the generated trajectories has been neglected, which makes these methods useless for practical urban simulation scenarios. To solve this problem, we propose a novel two-stage generative adversarial framework to generate the continuous trajectory on the road network, namely TS-TrajGen, which efficiently integrates prior domain knowledge of human mobility with model-free learning paradigm. Specifically, we build the generator under the human mobility hypothesis of the A* algorithm to learn the human mobility behavior. For the discriminator, we combine the sequential reward with the mobility yaw reward to enhance the effectiveness of the generator. Finally, we propose a novel two-stage generation process to overcome the weak point of the existing stochastic generation process. Extensive experiments on two real-world datasets and two case studies demonstrate that our framework yields significant improvements over the state-of-the-art methods.  ( 2 min )
    On Using Deep Learning Proxies as Forward Models in Deep Learning Problems. (arXiv:2301.07102v1 [cs.LG])
    Physics-based optimization problems are generally very time-consuming, especially due to the computational complexity associated with the forward model. Recent works have demonstrated that physics-modelling can be approximated with neural networks. However, there is always a certain degree of error associated with this learning, and we study this aspect in this paper. We demonstrate through experiments on popular mathematical benchmarks, that neural network approximations (NN-proxies) of such functions when plugged into the optimization framework, can lead to erroneous results. In particular, we study the behavior of particle swarm optimization and genetic algorithm methods and analyze their stability when coupled with NN-proxies. The correctness of the approximate model depends on the extent of sampling conducted in the parameter space, and through numerical experiments, we demonstrate that caution needs to be taken when constructing this landscape with neural networks. Further, the NN-proxies are hard to train for higher dimensional functions, and we present our insights for 4D and 10D problems. The error is higher for such cases, and we demonstrate that it is sensitive to the choice of the sampling scheme used to build the NN-proxy. The code is available at https://github.com/Fa-ti-ma/NN-proxy-in-optimization.  ( 2 min )
    The moral authority of ChatGPT. (arXiv:2301.07098v1 [cs.CY])
    ChatGPT is not only fun to chat with, but it also searches information, answers questions, and gives advice. With consistent moral advice, it might improve the moral judgment and decisions of users, who often hold contradictory moral beliefs. Unfortunately, ChatGPT turns out highly inconsistent as a moral advisor. Nonetheless, it influences users' moral judgment, we find in an experiment, even if they know they are advised by a chatting bot, and they underestimate how much they are influenced. Thus, ChatGPT threatens to corrupt rather than improves users' judgment. These findings raise the question of how to ensure the responsible use of ChatGPT and similar AI. Transparency is often touted but seems ineffective. We propose training to improve digital literacy.  ( 2 min )
    Distributed LSTM-Learning from Differentially Private Label Proportions. (arXiv:2301.07101v1 [cs.LG])
    Data privacy and decentralised data collection has become more and more popular in recent years. In order to solve issues with privacy, communication bandwidth and learning from spatio-temporal data, we will propose two efficient models which use Differential Privacy and decentralized LSTM-Learning: One, in which a Long Short Term Memory (LSTM) model is learned for extracting local temporal node constraints and feeding them into a Dense-Layer (LabelProportionToLocal). The other approach extends the first one by fetching histogram data from the neighbors and joining the information with the LSTM output (LabelProportionToDense). For evaluation two popular datasets are used: Pems-Bay and METR-LA. Additionally, we provide an own dataset, which is based on LuST. The evaluation will show the tradeoff between performance and data privacy.  ( 2 min )
    EENet: Learning to Early Exit for Adaptive Inference. (arXiv:2301.07099v1 [cs.LG])
    Budgeted adaptive inference with early exits is an emerging technique to improve the computational efficiency of deep neural networks (DNNs) for edge AI applications with limited resources at test time. This method leverages the fact that different test data samples may not require the same amount of computation for a correct prediction. By allowing early exiting from full layers of DNN inference for some test examples, we can reduce latency and improve throughput of edge inference while preserving performance. Although there have been numerous studies on designing specialized DNN architectures for training early-exit enabled DNN models, most of the existing work employ hand-tuned or manual rule-based early exit policies. In this study, we introduce a novel multi-exit DNN inference framework, coined as EENet, which leverages multi-objective learning to optimize the early exit policy for a trained multi-exit DNN under a given inference budget. This paper makes two novel contributions. First, we introduce the concept of early exit utility scores by combining diverse confidence measures with class-wise prediction scores to better estimate the correctness of test-time predictions at a given exit. Second, we train a lightweight, budget-driven, multi-objective neural network over validation predictions to learn the exit assignment scheduling for query examples at test time. The EENet early exit scheduler optimizes both the distribution of test samples to different exits and the selection of the exit utility thresholds such that the given inference budget is satisfied while the performance metric is maximized. Extensive experiments are conducted on five benchmarks, including three image datasets (CIFAR-10, CIFAR-100, ImageNet) and two NLP datasets (SST-2, AgNews). The results demonstrate the performance improvements of EENet compared to existing representative early exit techniques.  ( 2 min )
  • Open

    Data thinning for convolution-closed distributions. (arXiv:2301.07276v1 [stat.ME])
    We propose data thinning, a new approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to any observation drawn from a "convolution closed" distribution, a class that includes the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. It is similar in spirit to -- but distinct from, and more easily applicable than -- a recent proposal known as data fission. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis.  ( 2 min )
    Optimal Sub-sampling to Boost Power of Kernel Sequential Change-point Detection. (arXiv:2210.15060v2 [stat.ME] UPDATED)
    We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel Cumulative Sum (CUSUM) procedures, and improved performance is observed from extensive numerical experiments.  ( 2 min )
    A Combinatorial Semi-Bandit Approach to Charging Station Selection for Electric Vehicles. (arXiv:2301.07156v1 [cs.LG])
    In this work, we address the problem of long-distance navigation for battery electric vehicles (BEVs), where one or more charging sessions are required to reach the intended destination. We consider the availability and performance of the charging stations to be unknown and stochastic, and develop a combinatorial semi-bandit framework for exploring the road network to learn the parameters of the queue time and charging power distributions. Within this framework, we first outline a pre-processing for the road network graph to handle the constrained combinatorial optimization problem in an efficient way. Then, for the pre-processed graph, we use a Bayesian approach to model the stochastic edge weights, utilizing conjugate priors for the one-parameter exponential and two-parameter gamma distributions, the latter of which is novel to multi-armed bandit literature. Finally, we apply combinatorial versions of Thompson Sampling, BayesUCB and Epsilon-greedy to the problem. We demonstrate the performance of our framework on long-distance navigation problem instances in country-sized road networks, with simulation experiments in Norway, Sweden and Finland.  ( 2 min )
    Learning Deformation Trajectories of Boltzmann Densities. (arXiv:2301.07388v1 [stat.ML])
    We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = (|x|/\sigma)^p$. This then induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along this family of densities. Concretely, this condition can be translated to a PDE between $V_t$ and $f_t$ and we minimize the amount by which this PDE fails to hold. We compare this objective to the reverse KL-divergence on Gaussian mixtures and on the $\phi^4$ lattice field theory on a circle.  ( 2 min )
    Reliable amortized variational inference with physics-based latent distribution correction. (arXiv:2207.11640v3 [stat.ML] UPDATED)
    Bayesian inference for high-dimensional inverse problems is computationally costly and requires selecting a suitable prior distribution. Amortized variational inference addresses these challenges via a neural network that approximates the posterior distribution not only for one instance of data, but a distribution of data pertaining to a specific inverse problem. During inference, the neural network -- in our case a conditional normalizing flow -- provides posterior samples at virtually no cost. However, the accuracy of amortized variational inference relies on the availability of high-fidelity training data, which seldom exists in geophysical inverse problems due to the Earth's heterogeneity. In addition, the network is prone to errors if evaluated over out-of-distribution data. As such, we propose to increase the resilience of amortized variational inference in the presence of moderate data distribution shifts. We achieve this via a correction to the latent distribution that improves the posterior distribution approximation for the data at hand. The correction involves relaxing the standard Gaussian assumption on the latent distribution and parameterizing it via a Gaussian distribution with an unknown mean and (diagonal) covariance. These unknowns are then estimated by minimizing the Kullback-Leibler divergence between the corrected and the (physics-based) true posterior distributions. While generic and applicable to other inverse problems, by means of a linearized seismic imaging example, we show that our correction step improves the robustness of amortized variational inference with respect to changes in the number of seismic sources, noise variance, and shifts in the prior distribution. This approach provides a seismic image with limited artifacts and an assessment of its uncertainty at approximately the same cost as five reverse-time migrations.  ( 2 min )
    A Nonsmooth Dynamical Systems Perspective on Accelerated Extensions of ADMM. (arXiv:1808.04048v7 [math.OC] UPDATED)
    Recently, there has been great interest in connections between continuous-time dynamical systems and optimization methods, notably in the context of accelerated methods for smooth and unconstrained problems. In this paper we extend this perspective to nonsmooth and constrained problems by obtaining differential inclusions associated to novel accelerated variants of the alternating direction method of multipliers (ADMM). Through a Lyapunov analysis, we derive rates of convergence for these dynamical systems in different settings that illustrate an interesting tradeoff between decaying versus constant damping strategies. We also obtain modified equations capturing fine-grained details of these methods, which have improved stability and preserve the leading order convergence rates. An extension to general nonlinear equality and inequality constraints in connection with singular perturbation theory is provided.  ( 2 min )
    Optimistic Dynamic Regret Bounds. (arXiv:2301.07530v1 [cs.LG])
    Online Learning (OL) algorithms have originally been developed to guarantee good performances when comparing their output to the best fixed strategy. The question of performance with respect to dynamic strategies remains an active research topic. We develop in this work dynamic adaptations of classical OL algorithms based on the use of experts' advice and the notion of optimism. We also propose a constructivist method to generate those advices and eventually provide both theoretical and experimental guarantees for our procedures.  ( 2 min )
    Functional Neural Networks: Shift invariant models for functional data with applications to EEG classification. (arXiv:2301.05869v1 [cs.LG] CROSS LISTED)
    It is desirable for statistical models to detect signals of interest independently of their position. If the data is generated by some smooth process, this additional structure should be taken into account. We introduce a new class of neural networks that are shift invariant and preserve smoothness of the data: functional neural networks (FNNs). For this, we use methods from functional data analysis (FDA) to extend multi-layer perceptrons and convolutional neural networks to functional data. We propose different model architectures, show that the models outperform a benchmark model from FDA in terms of accuracy and successfully use FNNs to classify electroencephalography (EEG) data.  ( 2 min )
    Complexity Analysis of a Countable-armed Bandit Problem. (arXiv:2301.07243v1 [cs.LG])
    We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $\mathcal{O}\left( \log n \right)$. We also show that the instance-independent (minimax) regret is $\tilde{\mathcal{O}}\left( \sqrt{n} \right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.  ( 2 min )
    Improving Federated Learning Personalization via Model Agnostic Meta Learning. (arXiv:1909.12488v2 [cs.LG] UPDATED)
    Federated Learning (FL) refers to learning a high quality global model based on decentralized data storage, without ever copying the raw data. A natural scenario arises with data created on mobile phones by the activity of their users. Given the typical data heterogeneity in such situations, it is natural to ask how can the global model be personalized for every such device, individually. In this work, we point out that the setting of Model Agnostic Meta Learning (MAML), where one optimizes for a fast, gradient-based, few-shot adaptation to a heterogeneous distribution of tasks, has a number of similarities with the objective of personalization for FL. We present FL as a natural source of practical applications for MAML algorithms, and make the following observations. 1) The popular FL algorithm, Federated Averaging, can be interpreted as a meta learning algorithm. 2) Careful fine-tuning can yield a global model with higher accuracy, which is at the same time easier to personalize. However, solely optimizing for the global model accuracy yields a weaker personalization result. 3) A model trained using a standard datacenter optimization method is much harder to personalize, compared to one trained using Federated Averaging, supporting the first claim. These results raise new questions for FL, MAML, and broader ML research.  ( 2 min )
    What relations are reliably embeddable in Euclidean space?. (arXiv:1903.05347v3 [cs.LG] UPDATED)
    We consider the problem of embedding a relation, represented as a directed graph, into Euclidean space. For three types of embeddings motivated by the recent literature on knowledge graphs, we obtain characterizations of which relations they are able to capture, as well as bounds on the minimal dimensionality and precision needed.  ( 2 min )
    Global Contrastive Batch Sampling via Optimization on Sample Permutations. (arXiv:2210.12874v3 [cs.LG] UPDATED)
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v2 [math.ST] UPDATED)
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Adversarial Robust Deep Reinforcement Learning Requires Redefining Robustness. (arXiv:2301.07487v1 [cs.LG])
    Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.  ( 2 min )
    PENDANTSS: PEnalized Norm-ratios Disentangling Additive Noise, Trend and Sparse Spikes. (arXiv:2301.01514v1 [eess.SP] CROSS LISTED)
    Denoising, detrending, deconvolution: usual restoration tasks, traditionally decoupled. Coupled formulations entail complex ill-posed inverse problems. We propose PENDANTSS for joint trend removal and blind deconvolution of sparse peak-like signals. It blends a parsimonious prior with the hypothesis that smooth trend and noise can somewhat be separated by low-pass filtering. We combine the generalized quasi-norm ratio SOOT/SPOQ sparse penalties $\ell_p/\ell_q$ with the BEADS ternary assisted source separation algorithm. This results in a both convergent and efficient tool, with a novel Trust-Region block alternating variable metric forward-backward approach. It outperforms comparable methods, when applied to typically peaked analytical chemistry signals. Reproducible code is provided.  ( 2 min )
    Using Topological Data Analysis to classify Encrypted Bits. (arXiv:2301.07393v1 [cs.CR])
    We present a way to apply topological data analysis for classifying encrypted bits into distinct classes. Persistent homology is applied to generate topological features of a point cloud obtained from sets of encryptions. We see that this machine learning pipeline is able to classify our data successfully where classical models of machine learning fail to perform the task. We also see that this pipeline works as a dimensionality reduction method making this approach to classify encrypted data a realistic method to classify the given encryptioned bits.  ( 2 min )
    Physics-informed Information Field Theory for Modeling Physical Systems with Uncertainty Quantification. (arXiv:2301.07609v1 [stat.ML])
    Data-driven approaches coupled with physical knowledge are powerful techniques to model systems. The goal of such models is to efficiently solve for the underlying field by combining measurements with known physical laws. As many systems contain unknown elements, such as missing parameters, noisy data, or incomplete physical laws, this is widely approached as an uncertainty quantification problem. The common techniques to handle all the variables typically depend on the numerical scheme used to approximate the posterior, and it is desirable to have a method which is independent of any such discretization. Information field theory (IFT) provides the tools necessary to perform statistics over fields that are not necessarily Gaussian. We extend IFT to physics-informed IFT (PIFT) by encoding the functional priors with information about the physical laws which describe the field. The posteriors derived from this PIFT remain independent of any numerical scheme and can capture multiple modes, allowing for the solution of problems which are ill-posed. We demonstrate our approach through an analytical example involving the Klein-Gordon equation. We then develop a variant of stochastic gradient Langevin dynamics to draw samples from the joint posterior over the field and model parameters. We apply our method to numerical examples with various degrees of model-form error and to inverse problems involving nonlinear differential equations. As an addendum, the method is equipped with a metric which allows the posterior to automatically quantify model-form uncertainty. Because of this, our numerical experiments show that the method remains robust to even an incorrect representation of the physics given sufficient data. We numerically demonstrate that the method correctly identifies when the physics cannot be trusted, in which case it automatically treats learning the field as a regression problem.  ( 2 min )
    Dual-sPLS: a family of Dual Sparse Partial Least Squares regressions for feature selection and prediction with tunable sparsity; evaluation on simulated and near-infrared (NIR) data. (arXiv:2301.07206v1 [stat.ML])
    Relating a set of variables X to a response y is crucial in chemometrics. A quantitative prediction objective can be enriched by qualitative data interpretation, for instance by locating the most influential features. When high-dimensional problems arise, dimension reduction techniques can be used. Most notable are projections (e.g. Partial Least Squares or PLS ) or variable selections (e.g. lasso). Sparse partial least squares combine both strategies, by blending variable selection into PLS. The variant presented in this paper, Dual-sPLS, generalizes the classical PLS1 algorithm. It provides balance between accurate prediction and efficient interpretation. It is based on penalizations inspired by classical regression methods (lasso, group lasso, least squares, ridge) and uses the dual norm notion. The resulting sparsity is enforced by an intuitive shrinking ratio parameter. Dual-sPLS favorably compares to similar regression methods, on simulated and real chemical data. Code is provided as an open-source package in R: \url{https://CRAN.R-project.org/package=dual.spls}.  ( 2 min )
    Discrete Latent Structure in Neural Networks. (arXiv:2301.07473v1 [cs.LG])
    Many types of data from fields including natural language processing, computer vision, and bioinformatics, are well represented by discrete, compositional structures such as trees, sequences, or matchings. Latent structure models are a powerful tool for learning to extract such representations, offering a way to incorporate structural bias, discover insight about the data, and interpret decisions. However, effective training is challenging, as neural networks are typically designed for continuous computation. This text explores three broad strategies for learning with discrete latent structure: continuous relaxation, surrogate gradients, and probabilistic estimation. Our presentation relies on consistent notations for a wide range of models. As such, we reveal many new connections between latent structure learning strategies, showing how most consist of the same small set of fundamental building blocks, but use them differently, leading to substantially different applicability and properties.  ( 2 min )
    Sample Complexity of Adversarially Robust Linear Classification on Separated Data. (arXiv:2012.10794v3 [cs.LG] UPDATED)
    We consider the sample complexity of learning with adversarial robustness. Most prior theoretical results for this problem have considered a setting where different classes in the data are close together or overlapping. Motivated by some real applications, we consider, in contrast, the well-separated case where there exists a classifier with perfect accuracy and robustness, and show that the sample complexity narrates an entirely different story. Specifically, for linear classifiers, we show a large class of well-separated distributions where the expected robust loss of any algorithm is at least $\Omega(\frac{d}{n})$, whereas the max margin algorithm has expected standard loss $O(\frac{1}{n})$. This shows a gap in the standard and robust losses that cannot be obtained via prior techniques. Additionally, we present an algorithm that, given an instance where the robustness radius is much smaller than the gap between the classes, gives a solution with expected robust loss is $O(\frac{1}{n})$. This shows that for very well-separated data, convergence rates of $O(\frac{1}{n})$ are achievable, which is not the case otherwise. Our results apply to robustness measured in any $\ell_p$ norm with $p > 1$ (including $p = \infty$).  ( 2 min )
    Electronic excited states in deep variational Monte Carlo. (arXiv:2203.09472v3 [physics.chem-ph] UPDATED)
    Obtaining accurate ground and low-lying excited states of electronic systems is crucial in a multitude of important applications. One ab initio method for solving the Schr\"odinger equation that scales favorably for large systems is variational quantum Monte Carlo (QMC). The recently introduced deep QMC approach uses ansatzes represented by deep neural networks and generates nearly exact ground-state solutions for molecules containing up to a few dozen electrons, with the potential to scale to much larger systems where other highly accurate methods are not feasible. In this paper, we extend one such ansatz (PauliNet) to compute electronic excited states. We demonstrate our method on various small atoms and molecules and consistently achieve high accuracy for low-lying states. To highlight the method's potential, we compute the first excited state of the much larger benzene molecule, as well as the conical intersection of ethylene, with PauliNet matching results of more expensive high-level methods.  ( 2 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v5 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 2 min )
    Landscape Complexity for the Empirical Risk of Generalized Linear Models. (arXiv:1912.02143v5 [stat.ML] UPDATED)
    We present a method to obtain the average and the typical value of the number of critical points of the empirical risk landscape for generalized linear estimation problems and variants. This represents a substantial extension of previous applications of the Kac-Rice method since it allows to analyze the critical points of high dimensional non-Gaussian random functions. Under a technical hypothesis, we obtain a rigorous explicit variational formula for the annealed complexity, which is the logarithm of the average number of critical points at fixed value of the empirical risk. This result is simplified, and extended, using the non-rigorous Kac-Rice replicated method from theoretical physics. In this way we find an explicit variational formula for the quenched complexity, which is generally different from its annealed counterpart, and allows to obtain the number of critical points for typical instances up to exponential accuracy.  ( 2 min )
    Strong inductive biases provably prevent harmless interpolation. (arXiv:2301.07605v1 [stat.ML])
    Classical wisdom suggests that estimators should avoid fitting noise to achieve good generalization. In contrast, modern overparameterized models can yield small test error despite interpolating noise -- a phenomenon often called "benign overfitting" or "harmless interpolation". This paper argues that the degree to which interpolation is harmless hinges upon the strength of an estimator's inductive bias, i.e., how heavily the estimator favors solutions with a certain structure: while strong inductive biases prevent harmless interpolation, weak inductive biases can even require fitting noise to generalize well. Our main theoretical result establishes tight non-asymptotic bounds for high-dimensional kernel regression that reflect this phenomenon for convolutional kernels, where the filter size regulates the strength of the inductive bias. We further provide empirical evidence of the same behavior for deep neural networks with varying filter sizes and rotational invariance.  ( 2 min )
    Non-IID Quantum Federated Learning with One-shot Communication Complexity. (arXiv:2209.00768v2 [quant-ph] UPDATED)
    Federated learning refers to the task of machine learning based on decentralized data from multiple clients with secured data privacy. Recent studies show that quantum algorithms can be exploited to boost its performance. However, when the clients' data are not independent and identically distributed (IID), the performance of conventional federated algorithms is known to deteriorate. In this work, we explore the non-IID issue in quantum federated learning with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into local channels trained by each client with the help of local density estimators. This observation leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. Numerical simulations show that the proposed algorithm outperforms the conventional ones significantly under non-IID settings.  ( 2 min )
    An Analysis of Loss Functions for Binary Classification and Regression. (arXiv:2301.07638v1 [stat.ML])
    This paper explores connections between margin-based loss functions and consistency in binary classification and regression applications. It is shown that a large class of margin-based loss functions for binary classification/regression result in estimating scores equivalent to log-likelihood scores weighted by an even function. A simple characterization for conformable (consistent) loss functions is given, which allows for straightforward comparison of different losses, including exponential loss, logistic loss, and others. The characterization is used to construct a new Huber-type loss function for the logistic model. A simple relation between the margin and standardized logistic regression residuals is derived, demonstrating that all margin-based loss can be viewed as loss functions of squared standardized logistic regression residuals. The relation provides new, straightforward interpretations for exponential and logistic loss, and aids in understanding why exponential loss is sensitive to outliers. In particular, it is shown that minimizing empirical exponential loss is equivalent to minimizing the sum of squared standardized logistic regression residuals. The relation also provides new insight into the AdaBoost algorithm.  ( 2 min )
    Large Deviations for Classification Performance Analysis of Machine Learning Systems. (arXiv:2301.07104v1 [cs.LG])
    We study the performance of machine learning binary classification techniques in terms of error probabilities. The statistical test is based on the Data-Driven Decision Function (D3F), learned in the training phase, i.e., what is thresholded before the final binary decision is made. Based on large deviations theory, we show that under appropriate conditions the classification error probabilities vanish exponentially, as $\sim \exp\left(-n\,I + o(n) \right)$, where $I$ is the error rate and $n$ is the number of observations available for testing. We also propose two different approximations for the error probability curves, one based on a refined asymptotic formula (often referred to as exact asymptotics), and another one based on the central limit theorem. The theoretical findings are finally tested using the popular MNIST dataset.  ( 2 min )

  • Open

    [D] ICLR 2023 results.
    Hi, Making a post for anything to be discussed related to ICLR 2023 results ​ One question I had: Is the exact time of result announcement fixed? submitted by /u/East-Beginning9987 [link] [comments]  ( 41 min )
    [Discussion] I'm Getting 50FPS With 4 Billion Parameters, Is That Good? - Compute Shader Implementation
    So I like to DIY a lot. I coded a neural network to run parallel in a compute shader with TanH activation, and the performance was much better than I expected. I Tested with a 3090 with many layers of 20,000 neurons until I reached 4 billion total parameters which ran around 50FPS when looped every frame. Is this above average performance for a GPU implementations? I haven't really tested out any other GPU implementations, so I was wondering if anyone here knows. submitted by /u/TheRPGGamerMan [link] [comments]  ( 42 min )
    I'm working on a project and need a open source chatbot that I can run locally and train to talk like a specific character, does anyone know one? [P]
    Title, I am trying to get a chatbot to act like Megumin, similar to Character AI, but open source and can be run on a local machine. Thank you! submitted by /u/otakuhacker123 [link] [comments]  ( 42 min )
    [D][P] Best Speech-to-Text Model for domain-specific data - Open Source vs. Paid Services
    I originally posted this here on r/learnmachinelearning but reposting here as it may be a more appropriate subreddit and/or may have a different perspective. I want a tool to programmatically generate transcripts from sermons. I have access to hundreds of sermon transcripts (and 100x more very similar in domain data) but less than a 40 transcripts with audio (~30 hours). I want the lowest WER (Word Error Rate) possible and can budge 100 hours for this project in 2023. Train my own acoustic model with the best open source offering OpenAI's Whisper seems to be the best available today? How much supervised data (e.g. hours of sermons with perfect transcripts) would I need to develop a model that would be more accurate than Google/AWS for my specific domain? Can I take a model already trained and "tune" it by augmenting the data I have? Use the best cloud speech-to-text API that I can provide in-domain data to to tune it AWS Transcribe and Google Speech to Text seem to be big players I've gone with AWS Transcribe since it can be tuned more easily with custom domain data (just upload text files) than Google's (which requires building phrase dictionaries with weights). Is there anything out there that's better for my use case? submitted by /u/Knecht_Christi [link] [comments]  ( 43 min )
    [D] Pre-trained Models for Domain-Specific (i.e. Stylistic) Feature Extraction
    Most or all of the style transfer models are used to extract the domain-independent (robust) feature from an artpiece so as to apply it to different styles. But I need the opposite: I need a pretrained model that can extract the domain-specific (i.e. stylistic) feature from an artpiece. Are there any publicly available ones I can use? It doesn't matter whether it's a Github repository, a Huggingface API, or something else. Thank you! submitted by /u/No_Zookeepergame8794 [link] [comments]  ( 42 min )
    [P] Labeling tools are great, but what about quality checks?
    Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. MLfix is a new open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors. https://www.collabora.com/news-and-blog/blog/2023/01/17/labeling-tools-are-great-but-what-about-quality-checks/ submitted by /u/mfilion [link] [comments]  ( 42 min )
    [D] is it time to investigate retrieval language models?
    With ChatGPT going mainstream and the general push to make products out of LM, a problem remain about the cost of running such models. To me, it seems counterproductive to put both language modelling and knowledge inside the model weights. Is it time to shift to retrieval LM like Retro to keep the cost down while offering the same products? It would possibly allow Google or others to offer a free assistant service, using embeddings similarity search to retrieve results from the Internet so the model itself could possibly even run on edge devices? What are your thoughts about that subject? submitted by /u/hapliniste [link] [comments]  ( 46 min )
    [P] Code super clean multi-modal PyTorch models and easily serve them through FastAPI, using DocArray
    Hi all! I'd like to share an open source project that I am currently working on together with a few colleagues: DocArray! If you've ever trained models that deal with different data types (images, text, video, audio, ...) then you know how much of a hassle it can be to keep track of all of your tensors, what shapes they have, and what data they are meant to represent. That's what we're trying to change with DocArray, a Python library for representing, sending, and storing multi-modal data! The core idea of DocArray is that you define Documents that represent your data. For example, one Document could hold the file path to an image, its image tensor, and and image embedding that your model creates. A different Document could do the same thing for some Text, and a third Document might co…  ( 44 min )
    [P] Tired of generating synthetic corgis❓🐶 Check out Synthcity, a framework for synthetic tabular data
    🌟 Synthcity isa library for generating and benchmarking synthetic tabular data. https://github.com/vanderschaarlab/synthcity ​ 🚀 Synthcity includes a wide range of algorithms for various use cases, such as: - tabular data(CTGAN, TVAE, Bayesian Networks etc) - survival analysis(SurvivalGAN etc). - time series(Fourier Flows, TimeGAN, etc.). - privacy-focused(DP-GAN, PATEGAN, AdsGAN, DECAF). - domain adaptation(RadialGAN). ​ 🔍 Synthcity supports benchmarking multiple algorithms, testing data quality, downstream performance, statistical fidelity, and privacy metrics. ​ 🌀 Give it a try: - Library: https://github.com/vanderschaarlab/synthcity - Tutorial: https://colab.research.google.com/drive/1Vr2PJswgfFYBkJCm3hhVkuH-9dXnHeYV?usp=sharing - Docs: https://synthcity.readthedocs.io/ submitted by /u/ManagementBig2995 [link] [comments]  ( 42 min )
    [P] Need some recs on an NLP project
    Hello, for my job, I have to extract job responsibilities from job ads. I'm thinking of approaching it as a span extraction problem. Where I'm gonna label the job responsibility span manually for around 1000 samples. And use supervised learning. Is there any better way to approach this problem? Is there any pretrained model I can use to fine tune? Any suggestion will be appreciated. Thanks! submitted by /u/Salekeen01 [link] [comments]  ( 43 min )
    [R] Human-Timescale Adaptation in an Open-Ended Task Space - (AdA) - DeepMind 2023 - Can adapt to open-ended novel embodied 3D problems as quickly as humans!
    Paper: https://arxiv.org/abs/2301.07608 Youtube: https://www.youtube.com/watch?v=U93bUQ1roiw Please watch the Video the explanations are better than me giving you 3-5 Pictures! Abstract: Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. https://preview.redd.it/is3pyl1p70da1.jpg?width=1424&format=pjpg&auto=webp&s=8d102af4202711be7e01619577f109b6598d6ed5 submitted by /u/Singularian2501 [link] [comments]  ( 44 min )
    [D] Question about using diffusion to denoise images
    Hi all, I am trying to see if I can use DDPM (Denoising Diffusion Probabilistic Model) to denoise images using a supervised learning approach. However, I've learned that DDPM is only for unconditional image generation. Has anyone had experience using conditional DDPM and could help me out with some conceptual questions? Here's what I'm trying to understand: Say I have a pair of noisy and clean ground truth images. Should I take my clean image and gradually corrupt it by adding gaussian noise in the forward diffusion (FD) process? Could I get the network to learn the reverse diffusion process by giving it the noisy input, the FD noisy image, and positional embeddings? I was planning on concatenating the noisy input with the FD noisy image. During training, the network learns to predict noise at t-1 given the image at t conditioned on the input noisy source image. Here is an image showing you what I mean. Any thoughts or suggestions would be greatly appreciated. DDPM for image denoising submitted by /u/CurrentlyJoblessFML [link] [comments]  ( 44 min )
    [D] What is the name of this NLP technique?
    Lets say I have a dataset of real estate listings. I have a column of text that describes the listing, and another column that shows the number of rooms for example. In most of the cases, the number of rooms is shown in both columns, in the description text and also in the dedicated column. But for some observations, the number of rooms is in the description text but not in the column "number of rooms". So I have missing data. I could try to fill the missing data with by applying regex in the description text, but the number of possibilities seems to big. Is there a machine learning technique in NLP that allows me to do that, since it most of the observations the data is present in both column, so is "naturally labelled"? If there is, what is the name of these techniques? I would like to search about it but I don't know the proper keywords to google. submitted by /u/Kebet-Mendez [link] [comments]  ( 43 min )
    [D] Inner workings of the chatgpt memory
    All the examples from langchain and on huggingface create memory by pasting the entire history in every prompt. This seems to violate the max input prompt length pretty quickly. And it’s expensive. Does chatgpt use something revolutionary? It forgets everything when you create a new session so it ‘feels’ it’s using the convo as memory as well. But then the question; how do they get past prompt limits? Chunking doesn’t help as it still doesn’t get context in that case between prompts. Maybe they ask the same question with different chunks many times and then ask for a final result? Apologies if this was answered somewhere, I cannot find it at all and all examples use the same kind of history memory. submitted by /u/terserterseness [link] [comments]  ( 45 min )
    [D] very short video generation
    Hi, my indie game devs asked me if I could build a model that generates cool movements for their characters. 1) I wanted to start by generating characters and scene. Should I go for stable diffusion or for a GAN ? I do need a prompt 2) do you know any model that can generate short video clips (2 seconds) and that could potentially generate character movement ? Thank you so much ! submitted by /u/Frizzoux [link] [comments]  ( 42 min )
    [D] ML Researchers/Engineers in Industry: Why don't companies use open source models more often?
    In my experience at big tech, I've never seen any company use open-source ML models in production. Why is this the case? Curious, because there seems to be some insanely cool research going on these days. On the other hand, if you have seen this used, what kind of repos have you guys seen? submitted by /u/tennismlandguitar [link] [comments]  ( 46 min )
    [Discussion] Storing hundreds of ML models - what do you use?
    I am currently using the Google Cloud Model Registry and I want to learn what you use for archiving your machine learning models. What are the other options for developers who have to store hundreds of models? https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-model-registry submitted by /u/May-is-spring [link] [comments]  ( 42 min )
    [D] Best LLM for Question/Answering with personality?
    Hi, I am looking at fine tuning an open source LLM to answer questions as a specific character from chat on discord. I'm trying to decide which one to test between KoboldAI, GPTJ, Neo, Flan-T5, etc. Has anyone tries these LLMs and knows which would be best for this use case? The use case is to have a character that answers questions from discord chat in the form a specific character, with a personality and can be mean for example, very similar to https://beta.character.ai/ Does anyone want to guess what they use based on their experience or the closest LLM to replicate it? if this helps sometimes the model will go off on tangential stories which makes me think maybe it was trained on a novel or story dataset originally. submitted by /u/TernaryJimbo [link] [comments]  ( 42 min )
  • Open

    YouTube Video scripted entirely by ChatGPT???
    submitted by /u/McFIyyy [link] [comments]  ( 40 min )
    Inventory Management?
    I read a study that showed AI is being deployed into inventory management at a whopping 44% of total usage, though I can't quantify this, I am interested to know how to combine AI with my ERP, static data, a database etc. Does anyone here have experience with AI inventory management for MoQ, forecasting etc that they recommend? I want to do some homework submitted by /u/smudgepost [link] [comments]  ( 40 min )
    WhatsApp ChatGPT
    Hello i wanna ask you guys about this AI whatapp tool https://chattycat.ju.mp/ is it safe i saw it in a tweet and i was wondering if its the same as chat gpt and is it safe to give your name number and email " original tweet " https://twitter.com/TansuYegen/status/1616138232894492672 ​ WhatsApp gpt submitted by /u/Jnxe [link] [comments]  ( 40 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters don’t misunderstand AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 41 min )
    The Next Step (I believe)
    We have already seen the capabilities of AI sourcing from the Internet to create, learn, and exceed our expectations. I want to be quick with my description so I don't loose anybody. The next step I believe will be AI catered to us individually. What I mean is an app that only cares for one person, that being you. It will comb over (with permission hopefully) everything you ever put online. Also track writting, Grammer, interests (internet of things information), about you. Then if you allow it, can go over your medical or financial history, even psychology. It will feel like a chatbot but be a middleman towards seeking healthcare, mental care, and may other things including niché interests and possible career routes. I don't propose this to invite any fear or anxiety of a sci-fi narrative, but to objectively observe the trends with AI and my belief where it is going. submitted by /u/DropDeddBlue [link] [comments]  ( 41 min )
    2 months to make ai video last summer: “The technology was evolving so fast that I worried my video would feel outdated by the time it would be ready.”
    submitted by /u/defensiveFruit [link] [comments]  ( 40 min )
    Announcing a major update to Perplexity Ask: the world’s first conversational search engine! Now, you can read answers with up-to-date sources and ask follow-up questions to dig deeper. In other words, you can chat with your search engine!
    submitted by /u/rafs2006 [link] [comments]  ( 40 min )
    is the data we have today enough to create AGI?
    let's say hypothetically we were only able to work with digital data we have collected up until today to try and create AGI, would it be possible? submitted by /u/Science_is_Greatness [link] [comments]  ( 42 min )
    Labeling tools are great, but what about quality checks?
    Modern datasets contain hundreds of thousands to millions of labels that must be kept accurate. In practice, some errors in the dataset average out and can be ignored – systematic biases transfer to the model. After quick initial wins in areas where abundant data is readily available, deep learning needs to become more data efficient to help solve difficult business problems. MLfix is a new open-source tool that combines novel unsupervised machine-learning pipelines with a new user interface concept that, together, help annotators and machine-learning engineers identify and filter out label errors. https://www.collabora.com/news-and-blog/blog/2023/01/17/labeling-tools-are-great-but-what-about-quality-checks/ submitted by /u/mfilion [link] [comments]  ( 40 min )
    Google Research And DeepMind Create AI Medical Chatbot That Can Generate Safe And Helpful Answers!
    submitted by /u/liquidocelotYT [link] [comments]  ( 40 min )
    Software to read "Once upon a time in Hollywood" book in character voices from film.
    I want an audiobook version of the book "Once upon a time in Hollywood" by Tarantino with the actors voices from the movie, using the voices from the movie as sample. How do I do that? submitted by /u/Art3mis_eros [link] [comments]  ( 40 min )
    Professor Initiates Integration of ChatGPT in Classroom
    submitted by /u/lambolifeofficial [link] [comments]  ( 40 min )
    Neural Network 'Hallucinating' While Training On Dog Images
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 41 min )
    Wi-Fi Could Help Identify When You’re Struggling to Breathe
    submitted by /u/goronmask [link] [comments]  ( 40 min )
    Summarizing Text using In-database NLP through the Integration of Hugging Face with MindsDB
    submitted by /u/Klutzy_Accountant113 [link] [comments]  ( 40 min )
    Farmers Spend $5Billion a Year on Antibiotics For Their Animals using a Blanket Approach. Medicate all to Prevent Infection. AI Models are Being Used to Identify & Medicate Only Animals That are Actually Sick
    submitted by /u/HODLTID [link] [comments]  ( 40 min )
    I got frustrated with the time and effort required to code and maintain custom web scrapers, so I built an LLM-powered tool that can comprehend any website structure and extract the desired data in the preferred format.
    submitted by /u/madredditscientist [link] [comments]  ( 41 min )
    Generative AI Technology Is Discovering Completely New Drugs
    submitted by /u/bukowski3000 [link] [comments]  ( 40 min )
    Join us this Friday 6 pm EST for a fascinating discussion about the societal impact of large language models (LLMs) like ChatGPT
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    Exclusive: The $2 Per Hour Workers Who Made ChatGPT Safer
    submitted by /u/ohmsalad [link] [comments]  ( 40 min )
    Hey guys! I made some of the cartoon characters to look like villains. Can you guess which cartoon they are from?
    submitted by /u/_aimnftri [link] [comments]  ( 40 min )
    Made with AI
    submitted by /u/NorthTs [link] [comments]  ( 40 min )
  • Open

    Probability problem with Pratt prime proofs
    In the process of creating a Pratt certificate to prove that a number n is prime, you have to find a number a that seems kinda arbitrary. As we discussed here, a number n is prime if there exists a number a such that an-1 = 1 mod n and a(n-1)/p ≠ 1 mod n […] Probability problem with Pratt prime proofs first appeared on John D. Cook.  ( 5 min )
    Factoring b^n + 1
    The previous post illustrated a technique for finding factors of number of the form bn – 1. This post will look at an analogous, though slightly less general, technique for numbers of the form bn + 1. There is a theorem that says that if m divides n then bm + 1 divides bn + […] Factoring b^n + 1 first appeared on John D. Cook.  ( 5 min )
    Factoring b^n – 1
    Suppose you want to factor a number of the form bn – 1. There is a theorem that says that if m divides n then bm – 1 divides bn – 1. Let’s use this theorem to try to factor J = 22023 – 1, a 609-digit number. Factoring such a large number would be more difficult if it didn’t have […] Factoring b^n – 1 first appeared on John D. Cook.  ( 6 min )
  • Open

    AI’s Leg Up: Startup Accelerates Robotics Simulation for $8 Trillion Food Market
    Robots are finally getting a grip. Developers have been striving to close the gap on robotic gripping for the past several years, pursuing applications for multibillion-dollar industries. Securely gripping and transferring fast-moving items on conveyor belts holds vast promise for businesses. Soft Robotics, a Bedford, Mass., startup, is harnessing NVIDIA Isaac Sim to help close Read article >  ( 6 min )
    The Ultimate Upgrade: GeForce RTX 4080 SuperPOD Rollout Begins Today
    The Ultimate upgrade begins today: GeForce NOW RTX 4080 SuperPODs are now rolling out, bringing a new level of high-performance gaming to the cloud. Ultimate members will start to see RTX 4080 performance in their region soon, and experience titles like  Warhammer 40,000: Darktide, Cyberpunk 2077, The Witcher 3: Wild Hunt and more at ultimate Read article >  ( 5 min )
  • Open

    Human-Timescale Adaptation in an Open-Ended Task Space - (AdA) - DeepMind 2023 - Can adapt to open-ended novel embodied 3D problems as quickly as humans!
    Paper: https://arxiv.org/abs/2301.07608 Youtube: https://www.youtube.com/watch?v=U93bUQ1roiw Please watch the Video the explanations are better than me giving you 3-5 Pictures! Abstract: Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. https://preview.redd.it/rcta0vvt80da1.jpg?width=1424&format=pjpg&auto=webp&s=da7cc5745a21969b1687b7cf2a8c590dcac72ae0 submitted by /u/Singularian2501 [link] [comments]  ( 41 min )
    On the legal status of downloading and using ATARI 2600 ROMs
    Hi there, I have searched online quite broadly, and could not find an answer anywhere on the following questions: Is it legal to download ATARI ROMs, e.g., the ones in https://github.com/Farama-Foundation/AutoROM? Is it legal to use those ROMs? (Imagine I acquired them in a different way than downloading, like physical shipping on USB drive) If both are illegal, what's the legal situation around using them for reinforcement learning research? What I know from reading: Downloading ROMs is a gray area. It is supposedly illegal, but if the copyright holder knows about a use of their ROMs and doesn't do anything, there exist and "implied license" allowing their use. Source: here. Providing a means to download these ROMs is not illegal, it is allowed. Source: here. ​ Does anyone have more info or legal experience with this? ​ Thanks submitted by /u/Conscious_Heron_9133 [link] [comments]  ( 42 min )
    trying to make a single actuated link stay upright with dqn. The maximum score you can get from the game is 20,000, and the highest score I am getting is 200.
    submitted by /u/blackgentrifier [link] [comments]  ( 40 min )
  • Open

    Equivariant Networks for Crystal Structures. (arXiv:2211.15420v2 [cond-mat.mtrl-sci] UPDATED)
    Supervised learning with deep models has tremendous potential for applications in materials science. Recently, graph neural networks have been used in this context, drawing direct inspiration from models for molecules. However, materials are typically much more structured than molecules, which is a feature that these models do not leverage. In this work, we introduce a class of models that are equivariant with respect to crystalline symmetry groups. We do this by defining a generalization of the message passing operations that can be used with more general permutation groups, or that can alternatively be seen as defining an expressive convolution operation on the crystal graph. Empirically, these models achieve competitive results with state-of-the-art on property prediction tasks.
    Multi-Task Imitation Learning for Linear Dynamical Systems. (arXiv:2212.00186v2 [cs.LG] UPDATED)
    We study representation learning for efficient imitation learning over linear systems. In particular, we consider a setting where learning is split into two phases: (a) a pre-training step where a shared $k$-dimensional representation is learned from $H$ source policies, and (b) a target policy fine-tuning step where the learned representation is used to parameterize the policy class. We find that the imitation gap over trajectories generated by the learned target policy is bounded by $\tilde{O}\left( \frac{k n_x}{HN_{\mathrm{shared}}} + \frac{k n_u}{N_{\mathrm{target}}}\right)$, where $n_x > k$ is the state dimension, $n_u$ is the input dimension, $N_{\mathrm{shared}}$ denotes the total amount of data collected for each policy during representation learning, and $N_{\mathrm{target}}$ is the amount of target task data. This result formalizes the intuition that aggregating data across related tasks to learn a representation can significantly improve the sample efficiency of learning a target task. The trends suggested by this bound are corroborated in simulation.
    FedALA: Adaptive Local Aggregation for Personalized Federated Learning. (arXiv:2212.01197v2 [cs.LG] UPDATED)
    A key challenge in federated learning (FL) is the statistical heterogeneity that impairs the generalization of the global model on each client. To address this, we propose a method Federated learning with Adaptive Local Aggregation (FedALA) by capturing the desired information in the global model for client models in personalized FL. The key component of FedALA is an Adaptive Local Aggregation (ALA) module, which can adaptively aggregate the downloaded global model and local model towards the local objective on each client to initialize the local model before training in each iteration. To evaluate the effectiveness of FedALA, we conduct extensive experiments with five benchmark datasets in computer vision and natural language processing domains. FedALA outperforms eleven state-of-the-art baselines by up to 3.27% in test accuracy. Furthermore, we also apply ALA module to other federated learning methods and achieve up to 24.19% improvement in test accuracy.
    FeSAC: Federated Learning-Based Soft Actor-Critic Traffic Offloading in Space-Air-Ground Integrated Network. (arXiv:2212.02075v2 [cs.NI] UPDATED)
    With the increase of intelligent devices leading to increasing demand for traffic, traffic offloading has become a challenging problem. The space-air-ground integrated network (SAGIN) is a superior network architecture to solve this problem. The existing research on SAGIN traffic offloading only considers the single-layer satellite network in the space network. To further expand the resource pool of traffic offloading in SAGIN, we extend the single-layer satellite network into a double-layer satellite network composed of low-orbit satellites (LEO) and high-orbit satellites (GEO). And re-model a four-layer SAGIN architecture consisting of the ground network, the air network, LEO and GEO. Furthermore, we propose a novel Federated Soft Actor-Critic (FeSAC) traffic offloading method with positive environmental exploration to accommodate this dynamic and complex four-layer SAGIN architecture. The FeSAC method uses federated learning to train traffic offloading nodes and then aggregate the training results to obtain the best traffic offloading strategy. The simulation results show that under the four-layer SAGIN, our proposed method can better adapt to the network environment changes by nodes mobility and is better than the existing traffic offloading methods in throughput, packet loss, and transmission delay.
    Beyond ADMM: A Unified Client-variance-reduced Adaptive Federated Learning Framework. (arXiv:2212.01519v2 [cs.LG] UPDATED)
    As a novel distributed learning paradigm, federated learning (FL) faces serious challenges in dealing with massive clients with heterogeneous data distribution and computation and communication resources. Various client-variance-reduction schemes and client sampling strategies have been respectively introduced to improve the robustness of FL. Among others, primal-dual algorithms such as the alternating direction of method multipliers (ADMM) have been found being resilient to data distribution and outperform most of the primal-only FL algorithms. However, the reason behind remains a mystery still. In this paper, we firstly reveal the fact that the federated ADMM is essentially a client-variance-reduced algorithm. While this explains the inherent robustness of federated ADMM, the vanilla version of it lacks the ability to be adaptive to the degree of client heterogeneity. Besides, the global model at the server under client sampling is biased which slows down the practical convergence. To go beyond ADMM, we propose a novel primal-dual FL algorithm, termed FedVRA, that allows one to adaptively control the variance-reduction level and biasness of the global model. In addition, FedVRA unifies several representative FL algorithms in the sense that they are either special instances of FedVRA or are close to it. Extensions of FedVRA to semi/un-supervised learning are also presented. Experiments based on (semi-)supervised image classification tasks demonstrate superiority of FedVRA over the existing schemes in learning scenarios with massive heterogeneous clients and client sampling.
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v2 [cs.LG] UPDATED)
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D molecular graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D molecular graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.
    Unbalanced Optimal Transport, from Theory to Numerics. (arXiv:2211.08775v2 [stat.ML] UPDATED)
    Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.
    Algorithmic progress in computer vision. (arXiv:2212.05153v3 [cs.CV] UPDATED)
    We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms. Using Shapley values to attribute performance improvements, we find that algorithmic improvements have been roughly as important as the scaling of compute for progress computer vision. Our estimates indicate that algorithmic innovations mostly take the form of compute-augmenting algorithmic advances (which enable researchers to get better performance from less compute), not data-augmenting algorithmic advances. We find that compute-augmenting algorithmic advances are made at a pace more than twice as fast as the rate usually associated with Moore's law. In particular, we estimate that compute-augmenting innovations halve compute requirements every nine months (95\% confidence interval: 4 to 25 months).
    Accelerated Riemannian Optimization: Handling Constraints with a Prox to Bound Geometric Penalties. (arXiv:2211.14645v2 [math.OC] UPDATED)
    We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in a wide class of Hadamard manifolds. We achieve the same convergence rates as Nesterov's accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works \emph{resort to assuming} that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods of limited applicability. For our manifolds, this solves the open question in [KY22] about obtaining global general acceleration without iterates assumptively staying in the feasible set. In our solution, we design an accelerated Riemannian inexact proximal point algorithm, which is a result that was unknown even with exact access to the proximal operator, and is of independent interest. For smooth functions, we show we can implement the prox step inexactly with first-order methods in Riemannian balls of certain diameter that is enough for global accelerated optimization.
    The Benefits of Model-Based Generalization in Reinforcement Learning. (arXiv:2211.02222v2 [cs.LG] UPDATED)
    Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
    Geometric Knowledge Distillation: Topology Compression for Graph Neural Networks. (arXiv:2210.13014v2 [cs.LG] UPDATED)
    We study a new paradigm of knowledge transfer that aims at encoding graph topological information into graph neural networks (GNNs) by distilling knowledge from a teacher GNN model trained on a complete graph to a student GNN model operating on a smaller or sparser graph. To this end, we revisit the connection between thermodynamics and the behavior of GNN, based on which we propose Neural Heat Kernel (NHK) to encapsulate the geometric property of the underlying manifold concerning the architecture of GNNs. A fundamental and principled solution is derived by aligning NHKs on teacher and student models, dubbed as Geometric Knowledge Distillation. We develop non- and parametric instantiations and demonstrate their efficacy in various experimental settings for knowledge distillation regarding different types of privileged topological information and teacher-student schemes.
    SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning. (arXiv:2211.12509v2 [cs.LG] UPDATED)
    Recent years have witnessed remarkable advances in spatiotemporal predictive learning, incorporating auxiliary inputs, elaborate neural architectures, and sophisticated training strategies. Although impressive, the system complexity of mainstream methods is increasing as well, which may hinder the convenient applications. This paper proposes SimVP, a simple spatiotemporal predictive baseline model that is completely built upon convolutional networks without recurrent architectures and trained by common mean squared error loss in an end-to-end fashion. Without introducing any extra tricks and strategies, SimVP can achieve superior performance on various benchmark datasets. To further improve the performance, we derive variants with the gated spatiotemporal attention translator from SimVP that can achieve better performance. We demonstrate that SimVP has strong generalization and extensibility on real-world datasets through extensive experiments. The significant reduction in training cost makes it easier to scale to complex scenarios. We believe SimVP can serve as a solid baseline to benefit the spatiotemporal predictive learning community.
    torchode: A Parallel ODE Solver for PyTorch. (arXiv:2210.12375v2 [cs.LG] UPDATED)
    We introduce an ODE solver for the PyTorch ecosystem that can solve multiple ODEs in parallel independently from each other while achieving significant performance gains. Our implementation tracks each ODE's progress separately and is carefully optimized for GPUs and compatibility with PyTorch's JIT compiler. Its design lets researchers easily augment any aspect of the solver and collect and analyze internal solver statistics. In our experiments, our implementation is up to 4.3 times faster per step than other ODE solvers and it is robust against within-batch interactions that lead other solvers to take up to 4 times as many steps. Code available at https://github.com/martenlienen/torchode
    Black-box Coreset Variational Inference. (arXiv:2211.02377v2 [stat.ML] UPDATED)
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.
    Synthetic Dataset Generation for Privacy-Preserving Machine Learning. (arXiv:2210.03205v4 [cs.CR] UPDATED)
    Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets cannot be publicly released if they contain sensitive information such as medical records, and data privacy becomes a major concern. Encryption methods could be a possible solution, however their deployment on ML applications seriously impacts classification accuracy and results in substantial computational overhead. Alternatively, obfuscation techniques could be used, but maintaining a good trade-off between visual privacy and accuracy is challenging. In this paper, we propose a method to generate secure synthetic datasets from the original private datasets. Given a network with Batch Normalization (BN) layers pretrained on the original dataset, we first record the class-wise BN layer statistics. Next, we generate the synthetic dataset by optimizing random noise such that the synthetic data match the layer-wise statistical distribution of original images. We evaluate our method on image classification datasets (CIFAR10, ImageNet) and show that synthetic data can be used in place of the original CIFAR10/ImageNet data for training networks from scratch, producing comparable classification performance. Further, to analyze visual privacy provided by our method, we use Image Quality Metrics and show high degree of visual dissimilarity between the original and synthetic images. Moreover, we show that our proposed method preserves data-privacy under various privacy-leakage attacks including Gradient Matching Attack, Model Memorization Attack, and GAN-based Attack.
    Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. (arXiv:2211.14238v2 [cs.LG] UPDATED)
    Distribution shift occurs when the test distribution differs from the training distribution, and it can considerably degrade performance of machine learning models deployed in the real world. Temporal shifts -- distribution shifts arising from the passage of time -- often occur gradually and have the additional structure of timestamp metadata. By leveraging timestamp metadata, models can potentially learn from trends in past distribution shifts and extrapolate into the future. While recent works have studied distribution shifts, temporal shifts remain underexplored. To address this gap, we curate Wild-Time, a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning. We use two evaluation strategies: evaluation with a fixed time split (Eval-Fix) and evaluation with a data stream (Eval-Stream). Eval-Fix, our primary evaluation strategy, aims to provide a simple evaluation protocol, while Eval-Stream is more realistic for certain real-world applications. Under both evaluation strategies, we observe an average performance drop of 20% from in-distribution to out-of-distribution data. Existing methods are unable to close this gap. Code is available at https://wild-time.github.io/.
    Local Bayesian optimization via maximizing probability of descent. (arXiv:2210.11662v2 [cs.LG] UPDATED)
    Local optimization presents a promising approach to expensive, high-dimensional black-box optimization by sidestepping the need to globally explore the search space. For objective functions whose gradient cannot be evaluated directly, Bayesian optimization offers one solution -- we construct a probabilistic model of the objective, design a policy to learn about the gradient at the current location, and use the resulting information to navigate the objective landscape. Previous work has realized this scheme by minimizing the variance in the estimate of the gradient, then moving in the direction of the expected gradient. In this paper, we re-examine and refine this approach. We demonstrate that, surprisingly, the expected value of the gradient is not always the direction maximizing the probability of descent, and in fact, these directions may be nearly orthogonal. This observation then inspires an elegant optimization scheme seeking to maximize the probability of descent while moving in the direction of most-probable descent. Experiments on both synthetic and real-world objectives show that our method outperforms previous realizations of this optimization scheme and is competitive against other, significantly more complicated baselines.
    Betting the system: Using lineups to predict football scores. (arXiv:2210.06327v3 [cs.LG] UPDATED)
    This paper aims to reduce randomness in football by analysing the role of lineups in final scores using machine learning prediction models we have developed. Football clubs invest millions of dollars on lineups and knowing how individual statistics translate to better outcomes can optimise investments. Moreover, sports betting is growing exponentially and being able to predict the future is profitable and desirable. We use machine learning models and historical player data from English Premier League (2020-2022) to predict scores and to understand how individual performance can improve the outcome of a match. We compared different prediction techniques to maximise the possibility of finding useful models. We created heuristic and machine learning models predicting football scores to compare different techniques. We used different sets of features and shown goalkeepers stats are more important than attackers stats to predict goals scored. We applied a broad evaluation process to assess the efficacy of the models in real world applications. We managed to predict correctly all relegated teams after forecast 100 consecutive matches. We show that Support Vector Regression outperformed other techniques predicting final scores and that lineups do not improve predictions. Finally, our model was profitable (42% return) when emulating a betting system using real world odds data.
    Towards Out-of-Distribution Sequential Event Prediction: A Causal Treatment. (arXiv:2210.13005v2 [cs.LG] UPDATED)
    The goal of sequential event prediction is to estimate the next event based on a sequence of historical events, with applications to sequential recommendation, user behavior analysis and clinical treatment. In practice, the next-event prediction models are trained with sequential data collected at one time and need to generalize to newly arrived sequences in remote future, which requires models to handle temporal distribution shift from training to testing. In this paper, we first take a data-generating perspective to reveal a negative result that existing approaches with maximum likelihood estimation would fail for distribution shift due to the latent context confounder, i.e., the common cause for the historical events and the next event. Then we devise a new learning objective based on backdoor adjustment and further harness variational inference to make it tractable for sequence learning problems. On top of that, we propose a framework with hierarchical branching structures for learning context-specific representations. Comprehensive experiments on diverse tasks (e.g., sequential recommendation) demonstrate the effectiveness, applicability and scalability of our method with various off-the-shelf models as backbones.
    An Exponentially Converging Particle Method for the Mixed Nash Equilibrium of Continuous Games. (arXiv:2211.01280v2 [math.OC] UPDATED)
    We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as a time-implicit discretization of the "interacting" Wasserstein-Fisher-Rao gradient flow. We prove that, under non-degeneracy assumptions, this method converges at an exponential rate to the exact mixed Nash equilibrium from any initialization satisfying a natural notion of closeness to optimality. We illustrate our results with numerical experiments and discuss applications to max-margin and distributionally-robust classification using two-layer neural networks, where our method has a natural interpretation as a simultaneous training of the network's weights and of the adversarial distribution.
    Keypoint-GraspNet: Keypoint-based 6-DoF Grasp Generation from the Monocular RGB-D input. (arXiv:2209.08752v2 [cs.RO] UPDATED)
    Great success has been achieved in the 6-DoF grasp learning from the point cloud input, yet the computational cost due to the point set orderlessness remains a concern. Alternatively, we explore the grasp generation from the RGB-D input in this paper. The proposed solution, Keypoint-GraspNet, detects the projection of the gripper keypoints in the image space and then recover the SE(3) poses with a PnP algorithm. A synthetic dataset based on the primitive shape and the grasp family is constructed to examine our idea. Metric-based evaluation reveals that our method outperforms the baselines in terms of the grasp proposal accuracy, diversity, and the time cost. Finally, robot experiments show high success rate, demonstrating the potential of the idea in the real-world applications.
    Deep Counterfactual Estimation with Categorical Background Variables. (arXiv:2210.05811v4 [cs.LG] UPDATED)
    Referred to as the third rung of the causal inference ladder, counterfactual queries typically ask the "What if ?" question retrospectively. The standard approach to estimate counterfactuals resides in using a structural equation model that accurately reflects the underlying data generating process. However, such models are seldom available in practice and one usually wishes to infer them from observational data alone. Unfortunately, the correct structural equation model is in general not identifiable from the observed factual distribution. Nevertheless, in this work, we show that under the assumption that the main latent contributors to the treatment responses are categorical, the counterfactuals can be still reliably predicted. Building upon this assumption, we introduce CounterFactual Query Prediction (CFQP), a novel method to infer counterfactuals from continuous observations when the background variables are categorical. We show that our method significantly outperforms previously available deep-learning-based counterfactual methods, both theoretically and empirically on time series and image data. Our code is available at https://github.com/edebrouwer/cfqp.
    Automatic Generation of Product Concepts from Positive Examples, with an Application to Music Streaming. (arXiv:2210.01515v3 [cs.LG] UPDATED)
    Internet based businesses and products (e.g. e-commerce, music streaming) are becoming more and more sophisticated every day with a lot of focus on improving customer satisfaction. A core way they achieve this is by providing customers with an easy access to their products by structuring them in catalogues using navigation bars and providing recommendations. We refer to these catalogues as product concepts, e.g. product categories on e-commerce websites, public playlists on music streaming platforms. These product concepts typically contain products that are linked with each other through some common features (e.g. a playlist of songs by the same artist). How they are defined in the backend of the system can be different for different products. In this work, we represent product concepts using database queries and tackle two learning problems. First, given sets of products that all belong to the same unknown product concept, we learn a database query that is a representation of this product concept. Second, we learn product concepts and their corresponding queries when the given sets of products are associated with multiple product concepts. To achieve these goals, we propose two approaches that combine the concepts of PU learning with Decision Trees and Clustering. Our experiments demonstrate, via a simulated setup for a music streaming service, that our approach is effective in solving these problems.
    Improved Bounds on Neural Complexity for Representing Piecewise Linear Functions. (arXiv:2210.07236v3 [cs.LG] UPDATED)
    A deep neural network using rectified linear units represents a continuous piecewise linear (CPWL) function and vice versa. Recent results in the literature estimated that the number of neurons needed to exactly represent any CPWL function grows exponentially with the number of pieces or exponentially in terms of the factorial of the number of distinct linear components. Moreover, such growth is amplified linearly with the input dimension. These existing results seem to indicate that the cost of representing a CPWL function is expensive. In this paper, we propose much tighter bounds and establish a polynomial time algorithm to find a network satisfying these bounds for any given CPWL function. We prove that the number of hidden neurons required to exactly represent any CPWL function is at most a quadratic function of the number of pieces. In contrast to all previous results, this upper bound is invariant to the input dimension. Besides the number of pieces, we also study the number of distinct linear components in CPWL functions. When such a number is also given, we prove that the quadratic complexity turns into bilinear, which implies a lower neural complexity because the number of distinct linear components is always not greater than the minimum number of pieces in a CPWL function. When the number of pieces is unknown, we prove that, in terms of the number of distinct linear components, the neural complexities of any CPWL function are at most polynomial growth for low-dimensional inputs and factorial growth for the worst-case scenario, which are significantly better than existing results in the literature.
    Semi-Supervised Junction Tree Variational Autoencoder for Molecular Property Prediction. (arXiv:2208.05119v5 [cs.LG] UPDATED)
    Molecular Representation Learning is essential to solving many drug discovery and computational chemistry problems. It is a challenging problem due to the complex structure of molecules and the vast chemical space. Graph representations of molecules are more expressive than traditional representations, such as molecular fingerprints. Therefore, they can improve the performance of machine learning models. We propose SeMole, a method that augments the Junction Tree Variational Autoencoders, a state-of-the-art generative model for molecular graphs, with semi-supervised learning. SeMole aims to improve the accuracy of molecular property prediction when having limited labeled data by exploiting unlabeled data. We enforce that the model generates molecular graphs conditioned on target properties by incorporating the property into the latent representation. We propose an additional pre-training phase to improve the training process for our semi-supervised generative model. We perform an experimental evaluation on the ZINC dataset using three different molecular properties and demonstrate the benefits of semi-supervision.
    Neural Observer with Lyapunov Stability Guarantee for Uncertain Nonlinear Systems. (arXiv:2208.13006v2 [math.OC] UPDATED)
    In this paper, we propose a novel nonlinear observer based on neural networks, called neural observer, for observation tasks of linear time-invariant (LTI) systems and uncertain nonlinear systems. In particular, the neural observer designed for uncertain systems is inspired by the active disturbance rejection control, which can measure the uncertainty in real-time. The stability analysis (e.g., exponential convergence rate) of LTI and uncertain nonlinear systems (involving neural observers) are presented and guaranteed, where it is shown that the observation problems can be solved only using the linear matrix inequalities (LMIs). Also, it is revealed that the observability and controllability of the system matrices are required to demonstrate the existence of solutions of LMIs. Finally, the effectiveness of neural observers is verified on three simulation cases, including the X-29A aircraft model, the nonlinear pendulum, and the four-wheel steering vehicle.
    Brain Tumor Segmentation using Enhanced U-Net Model with Empirical Analysis. (arXiv:2210.13336v2 [eess.IV] UPDATED)
    Cancer of the brain is deadly and requires careful surgical segmentation. The brain tumors were segmented using U-Net using a Convolutional Neural Network (CNN). When looking for overlaps of necrotic, edematous, growing, and healthy tissue, it might be hard to get relevant information from the images. The 2D U-Net network was improved and trained with the BraTS datasets to find these four areas. U-Net can set up many encoder and decoder routes that can be used to get information from images that can be used in different ways. To reduce computational time, we use image segmentation to exclude insignificant background details. Experiments on the BraTS datasets show that our proposed model for segmenting brain tumors from MRI (MRI) works well. In this study, we demonstrate that the BraTS datasets for 2017, 2018, 2019, and 2020 do not significantly differ from the BraTS 2019 dataset's attained dice scores of 0.8717 (necrotic), 0.9506 (edema), and 0.9427 (enhancing).
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v2 [cs.CV] UPDATED)
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.
    Competition, Alignment, and Equilibria in Digital Marketplaces. (arXiv:2208.14423v2 [cs.GT] UPDATED)
    Competition between traditional platforms is known to improve user utility by aligning the platform's actions with user preferences. But to what extent is alignment exhibited in data-driven marketplaces? To study this question from a theoretical perspective, we introduce a duopoly market where platform actions are bandit algorithms and the two platforms compete for user participation. A salient feature of this market is that the quality of recommendations depends on both the bandit algorithm and the amount of data provided by interactions from users. This interdependency between the algorithm performance and the actions of users complicates the structure of market equilibria and their quality in terms of user utility. Our main finding is that competition in this market does not perfectly align market outcomes with user utility. Interestingly, market outcomes exhibit misalignment not only when the platforms have separate data repositories, but also when the platforms have a shared data repository. Nonetheless, the data sharing assumptions impact what mechanism drives misalignment and also affect the specific form of misalignment (e.g. the quality of the best-case and worst-case market outcomes). More broadly, our work illustrates that competition in digital marketplaces has subtle consequences for user utility that merit further investigation.
    Pattern Attention Transformer with Doughnut Kernel. (arXiv:2211.16961v2 [cs.CV] UPDATED)
    We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. In ViT, an image is cut into square-shaped patches. As the follow-up of ViT, Swin Transformer proposes an additional step of shifting to decrease the existence of fixed boundaries, which also incurs 'two connected Swin Transformer blocks' as the minimum unit of the model. Inheriting the patch/window idea, our doughnut kernel enhances the design of patches further. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels beyond square. To verify its performance on image classification, PAT is designed with Transformer blocks of regular octagon shape doughnut kernels. Its architecture is lighter: the minimum pattern attention layer is only one for each stage. Under similar complexity of computation, its performances on ImageNet 1K reach higher throughput (+10\%) and surpass Swin Transformer (+0.1 acc1).
    Breast Cancer Classification using Deep Learned Features Boosted with Handcrafted Features. (arXiv:2206.12815v2 [eess.IV] UPDATED)
    Breast cancer is one of the leading causes of death among women across the globe. It is difficult to treat if detected at advanced stages, however, early detection can significantly increase chances of survival and improves lives of millions of women. Given the widespread prevalence of breast cancer, it is of utmost importance for the research community to come up with the framework for early detection, classification and diagnosis. Artificial intelligence research community in coordination with medical practitioners are developing such frameworks to automate the task of detection. With the surge in research activities coupled with availability of large datasets and enhanced computational powers, it expected that AI framework results will help even more clinicians in making correct predictions. In this article, a novel framework for classification of breast cancer using mammograms is proposed. The proposed framework combines robust features extracted from novel Convolutional Neural Network (CNN) features with handcrafted features including HOG (Histogram of Oriented Gradients) and LBP (Local Binary Pattern). The obtained results on CBIS-DDSM dataset exceed state of the art.
    Quantized Training of Gradient Boosting Decision Trees. (arXiv:2207.09682v2 [cs.LG] UPDATED)
    Recent years have witnessed significant success in Gradient Boosting Decision Trees (GBDT) for a wide range of machine learning applications. Generally, a consensus about GBDT's training algorithms is gradients and statistics are computed based on high-precision floating points. In this paper, we investigate an essentially important question which has been largely ignored by the previous literature: how many bits are needed for representing gradients in training GBDT? To solve this mystery, we propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. Surprisingly, both our theoretical analysis and empirical studies show that the necessary precisions of gradients without hurting any performance can be quite low, e.g., 2 or 3 bits. With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits. Promisingly, these findings may pave the way for much more efficient training of GBDT from several aspects: (1) speeding up the computation of gradient statistics in histograms; (2) compressing the communication cost of high-precision statistical information during distributed training; (3) the inspiration of utilization and development of hardware architectures which well support low-precision computation for GBDT training. Benchmarked on CPUs, GPUs, and distributed clusters, we observe up to 2$\times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets, demonstrating the effectiveness and potential of the low-precision training of GBDT. The code will be released to the official repository of LightGBM.
    Progressive Domain Adaptation with Contrastive Learning for Object Detection in the Satellite Imagery. (arXiv:2209.02564v2 [cs.CV] UPDATED)
    Images in aerial datasets are very large in resolution, and each frame contains many dense and small objects. State-of-the-art detection methods fail to capture small objects, local features, and region proposals for densely overlapped objects in aerial imagery due to the high variation of object sizes in satellite imagery with respect to the image size and high variation of content. Aerial imagery content varies greatly within the dataset due to the large change in lighting conditions, and the type of ground imagery captures from high altitudes. The variation is even higher between different datasets as object sizes, class distributions, image acquisition, and weather conditions can vary even more drastically. Thus, Domain Adaptation (DA) has been introduced as a band-aid to alleviate the degradation of object identification in previously unseen datasets. In this paper, we propose a small object detection pipeline that improves the feature extraction process by spatial pyramid pooling, cross-stage partial networks, heat-map-based region proposal network, and objects localization and identification through a novel image difficulty score that adapts the overall focal loss measure based on the image difficulty. Next, we propose novel contrastive learning with progressive domain adaptation to produce domain-invariant features across aerial datasets using local and global features. Effective analysis and illustration of different performance metrics and challenges show that our proposed method is comparable to the current State-of-Art models and creates a first-ever Domain Adaptation benchmark for the object detection task in highly imbalanced satellite datasets with large domain gaps and dominant small objects.
    Sequence Model Imitation Learning with Unobserved Contexts. (arXiv:2208.02225v3 [cs.LG] UPDATED)
    We consider imitation learning problems where the learner's ability to mimic the expert increases throughout the course of an episode as more information is revealed. One example of this is when the expert has access to privileged information: while the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the hidden context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods. This is because on-policy algorithms provably learn to recover from their initially suboptimal actions, while off-policy methods treat their suboptimal past actions as though they came from the expert. This often manifests as a latching behavior: a naive repetition of past actions. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.
    ViNL: Visual Navigation and Locomotion Over Obstacles. (arXiv:2210.14791v2 [cs.RO] UPDATED)
    We present Visual Navigation and Locomotion over obstacles (ViNL), which enables a quadrupedal robot to navigate unseen apartments while stepping over small obstacles that lie in its path (e.g., shoes, toys, cables), similar to how humans and pets lift their feet over objects as they walk. ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands. Both the policies are entirely "model-free", i.e. sensors-to-actions neural networks trained end-to-end. The two are trained independently in two entirely different simulators and then seamlessly co-deployed by feeding the velocity commands from the navigator to the locomotor, entirely "zero-shot" (without any co-training). While prior works have developed learning methods for visual navigation or visual locomotion, to the best of our knowledge, this is the first fully learned approach that leverages vision to accomplish both (1) intelligent navigation in new environments, and (2) intelligent visual locomotion that aims to traverse cluttered environments without disrupting obstacles. On the task of navigation to distant goals in unknown environments, ViNL using just egocentric vision significantly outperforms prior work on robust locomotion using privileged terrain maps (+32.8% success and -4.42 collisions per meter). Additionally, we ablate our locomotion policy to show that each aspect of our approach helps reduce obstacle collisions. Videos and code at this http URL
    Efficient Signed Graph Sampling via Balancing & Gershgorin Disc Perfect Alignment. (arXiv:2208.08726v2 [eess.SP] UPDATED)
    A basic premise in graph signal processing (GSP) is that a graph encoding pairwise (anti-)correlations of the targeted signal as edge weights is exploited for graph filtering. However, existing fast graph sampling schemes are designed and tested only for positive graphs describing positive correlations. In this paper, we show that for datasets with strong inherent anti-correlations, a suitable graph contains both positive and negative edge weights. In response, we propose a linear-time signed graph sampling method centered on the concept of balanced signed graphs. Specifically, given an empirical covariance data matrix $\bar{\bf{C}}$, we first learn a sparse inverse matrix (graph Laplacian) $\mathcal{L}$ corresponding to a signed graph $\mathcal{G}$. We define the eigenvectors of Laplacian $\mathcal{L}_B$ for a balanced signed graph $\mathcal{G}_B$ -- approximating $\mathcal{G}$ via edge weight augmentation -- as graph frequency components. Next, we choose samples to minimize the low-pass filter reconstruction error in two steps. We first align all Gershgorin disc left-ends of Laplacian $\mathcal{L}_B$ at smallest eigenvalue $\lambda_{\min}(\mathcal{L}_B)$ via similarity transform $\mathcal{L}_p = \S \mathcal{L}_B \S^{-1}$, leveraging a recent linear algebra theorem called Gershgorin disc perfect alignment (GDPA). We then perform sampling on $\mathcal{L}_p$ using a previous fast Gershgorin disc alignment sampling (GDAS) scheme. Experimental results show that our signed graph sampling method outperformed existing fast sampling schemes noticeably on various datasets.
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v2 [math.OC] UPDATED)
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v3 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
    Self-Supervised Learning for Anomalous Channel Detection in EEG Graphs: Application to Seizure Analysis. (arXiv:2208.07448v4 [cs.LG] UPDATED)
    Electroencephalogram (EEG) signals are effective tools towards seizure analysis where one of the most important challenges is accurate detection of seizure events and brain regions in which seizure happens or initiates. However, all existing machine learning-based algorithms for seizure analysis require access to the labeled seizure data while acquiring labeled data is very labor intensive, expensive, as well as clinicians dependent given the subjective nature of the visual qualitative interpretation of EEG signals. In this paper, we propose to detect seizure channels and clips in a self-supervised manner where no access to the seizure data is needed. The proposed method considers local structural and contextual information embedded in EEG graphs by employing positive and negative sub-graphs. We train our method through minimizing contrastive and generative losses. The employ of local EEG sub-graphs makes the algorithm an appropriate choice when accessing to the all EEG channels is impossible due to complications such as skull fractures. We conduct an extensive set of experiments on the largest seizure dataset and demonstrate that our proposed framework outperforms the state-of-the-art methods in the EEG-based seizure study. The proposed method is the only study that requires no access to the seizure data in its training phase, yet establishes a new state-of-the-art to the field, and outperforms all related supervised methods.
    Dilated Neighborhood Attention Transformer. (arXiv:2209.15001v3 [cs.CV] UPDATED)
    Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.6% box AP in COCO object detection, 1.4% mask AP in COCO instance segmentation, and 1.4% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.5 PQ) and ADE20K (49.4 PQ), and instance segmentation model on Cityscapes (45.1 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.1 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).
    Tightening Discretization-based MILP Models for the Pooling Problem using Upper Bounds on Bilinear Terms. (arXiv:2207.03699v2 [math.OC] UPDATED)
    Discretization-based methods have been proposed for solving nonconvex optimization problems with bilinear terms such as the pooling problem. These methods convert the original nonconvex optimization problems into mixed-integer linear programs (MILPs). In this paper we study tightening methods for these MILP models for the pooling problem, and derive valid constraints using upper bounds on bilinear terms. Computational results demonstrate the effectiveness of our methods in terms of reducing solution time.
    Learning with Muscles: Benefits for Data-Efficiency and Robustness in Anthropomorphic Tasks. (arXiv:2207.03952v2 [cs.RO] UPDATED)
    Humans are able to outperform robots in terms of robustness, versatility, and learning of new tasks in a wide variety of movements. We hypothesize that highly nonlinear muscle dynamics play a large role in providing inherent stability, which is favorable to learning. While recent advances have been made in applying modern learning techniques to muscle-actuated systems both in simulation as well as in robotics, so far, no detailed analysis has been performed to show the benefits of muscles when learning from scratch. Our study closes this gap and showcases the potential of muscle actuators for core robotics challenges in terms of data-efficiency, hyperparameter sensitivity, and robustness.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v3 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v3 [cs.LG] UPDATED)
    There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically easy but computationally hard. Empirically, we find that a variety of neural networks successfully learn sparse parities, with discontinuous phase transitions in the training curves. On small instances, learning abruptly occurs at approximately $n^{O(k)}$ iterations; this nearly matches SQ lower bounds, despite the apparent lack of a sparse prior. Our theoretical analysis shows that these observations are not explained by a Langevin-like mechanism, whereby SGD "stumbles in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time). Instead, we show that SGD gradually amplifies the sparse solution via a Fourier gap in the population gradient, making continual progress that is invisible to loss and error metrics.
    OpenXAI: Towards a Transparent Evaluation of Model Explanations. (arXiv:2206.11104v3 [cs.LG] UPDATED)
    While several types of post hoc explanation methods (e.g., feature attribution methods) have been proposed in recent literature, there is little to no work on systematically benchmarking these methods in an efficient and transparent manner. Here, we introduce OpenXAI, a comprehensive and extensible open source framework for evaluating and benchmarking post hoc explanation methods. OpenXAI comprises of the following key components: (i) a flexible synthetic data generator and a collection of diverse real-world datasets, pre-trained models, and state-of-the-art feature attribution methods, (ii) open-source implementations of twenty-two quantitative metrics for evaluating faithfulness, stability (robustness), and fairness of explanation methods, and (iii) the first ever public XAI leaderboards to benchmark explanations. OpenXAI is easily extensible, as users can readily evaluate custom explanation methods and incorporate them into our leaderboards. Overall, OpenXAI provides an automated end-to-end pipeline that not only simplifies and standardizes the evaluation of post hoc explanation methods, but also promotes transparency and reproducibility in benchmarking these methods. OpenXAI datasets and data loaders, implementations of state-of-the-art explanation methods and evaluation metrics, as well as leaderboards are publicly available at https://open-xai.github.io/.
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v2 [math.OC] UPDATED)
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the nonconvex minimax problems with coupled linear constraints.
    Learning Deep Input-Output Stable Dynamics. (arXiv:2206.13093v3 [math.DS] UPDATED)
    Learning stable dynamics from observed time-series data is an essential problem in robotics, physical modeling, and systems biology. Many of these dynamics are represented as an inputs-output system to communicate with the external environment. In this study, we focus on input-output stable systems, exhibiting robustness against unexpected stimuli and noise. We propose a method to learn nonlinear systems guaranteeing the input-output stability. Our proposed method utilizes the differentiable projection onto the space satisfying the Hamilton-Jacobi inequality to realize the input-output stability. The problem of finding this projection can be formulated as a quadratic constraint quadratic programming problem, and we derive the particular solution analytically. Also, we apply our method to a toy bistable model and the task of training a benchmark generated from a glucose-insulin simulator. The results show that the nonlinear system with neural networks by our method achieves the input-output stability, unlike naive neural networks. Our code is available at https://github.com/clinfo/DeepIOStability.
    Evaluating Explainability for Graph Neural Networks. (arXiv:2208.09339v2 [cs.LG] UPDATED)
    As post hoc explanations are increasingly used to understand the behavior of graph neural networks (GNNs), it becomes crucial to evaluate the quality and reliability of GNN explanations. However, assessing the quality of GNN explanations is challenging as existing graph datasets have no or unreliable ground-truth explanations for a given task. Here, we introduce a synthetic graph data generator, ShapeGGen, which can generate a variety of benchmark datasets (e.g., varying graph sizes, degree distributions, homophilic vs. heterophilic graphs) accompanied by ground-truth explanations. Further, the flexibility to generate diverse synthetic datasets and corresponding ground-truth explanations allows us to mimic the data generated by various real-world applications. We include ShapeGGen and several real-world graph datasets into an open-source graph explainability library, GraphXAI. In addition to synthetic and real-world graph datasets with ground-truth explanations, GraphXAI provides data loaders, data processing functions, visualizers, GNN model implementations, and evaluation metrics to benchmark the performance of GNN explainability methods.
    On the effectiveness of persistent homology. (arXiv:2206.10551v3 [math.AT] UPDATED)
    Persistent homology (PH) is one of the most popular methods in Topological Data Analysis. Even though PH has been used in many different types of applications, the reasons behind its success remain elusive; in particular, it is not known for which classes of problems it is most effective, or to what extent it can detect geometric or topological features. The goal of this work is to identify some types of problems where PH performs well or even better than other methods in data analysis. We consider three fundamental shape analysis tasks: the detection of the number of holes, curvature and convexity from 2D and 3D point clouds sampled from shapes. Experiments demonstrate that PH is successful in these tasks, outperforming several baselines, including PointNet, an architecture inspired precisely by the properties of point clouds. In addition, we observe that PH remains effective for limited computational resources and limited training data, as well as out-of-distribution test data, including various data transformations and noise. For convexity detection, we provide a theoretical guarantee that PH is effective for this task in $\mathbb{R}^d$, and demonstrate the detection of a convexity measure on the FLAVIA data set of plant leaf images. Due to the crucial role of shape classification in understanding mathematical and physical structures and objects, and in many applications, the findings of this work will provide some knowledge about the types of problems that are appropriate for PH, so that it can - to borrow the words from Wigner 1960 - ``remain valid in future research, and extend, to our pleasure", but to our lesser bafflement, to a variety of applications.
    Continual Prune-and-Select: Class-incremental learning with specialized subnetworks. (arXiv:2208.04952v2 [cs.LG] UPDATED)
    The human brain is capable of learning tasks sequentially mostly without forgetting. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning one task after another. We address this challenge considering a class-incremental learning scenario where the DNN sees test data without knowing the task from which this data originates. During training, Continual-Prune-and-Select (CP&S) finds a subnetwork within the DNN that is responsible for solving a given task. Then, during inference, CP&S selects the correct subnetwork to make predictions for that task. A new task is learned by training available neuronal connections of the DNN (previously untrained) to create a new subnetwork by pruning, which can include previously trained connections belonging to other subnetwork(s) because it does not update shared connections. This enables to eliminate catastrophic forgetting by creating specialized regions in the DNN that do not conflict with each other while still allowing knowledge transfer across them. The CP&S strategy is implemented with different subnetwork selection strategies, revealing superior performance to state-of-the-art continual learning methods tested on various datasets (CIFAR-100, CUB-200-2011, ImageNet-100 and ImageNet-1000). In particular, CP&S is capable of sequentially learning 10 tasks from ImageNet-1000 keeping an accuracy around 94% with negligible forgetting, a first-of-its-kind result in class-incremental learning. To the best of the authors' knowledge, this represents an improvement in accuracy above 10% when compared to the best alternative method.
    AutoML Two-Sample Test. (arXiv:2206.08843v3 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.
    Towards Interpreting Vulnerability of Multi-Instance Learning via Customized and Universal Adversarial Perturbations. (arXiv:2211.17071v2 [cs.CV] UPDATED)
    Multiple-Instance Learning (MIL) is a recent machine learning paradigm which is immensely useful in various real-life applications, like image analysis, video anomaly detection, text classification, etc. It is well known that most of the existing machine learning classifiers are highly vulnerable to adversarial perturbations. Since MIL is a weakly supervised learning, where information is available for a set of instances, called bag and not for every instances, adversarial perturbations can be fatal. In this paper, we have proposed two adversarial perturbation methods to analyze the effect of adversarial perturbations to interpret the vulnerability of MIL methods. Out of the two algorithms, one can be customized for every bag, and the other is a universal one, which can affect all bags in a given data set and thus has some generalizability. Through simulations, we have also shown the effectiveness of the proposed algorithms to fool the state-of-the-art (SOTA) MIL methods. Finally, we have discussed through experiments, about taking care of these kind of adversarial perturbations through a simple strategy. Source codes are available at https://github.com/InkiInki/MI-UAP.
    NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. (arXiv:2206.04910v3 [cs.LG] UPDATED)
    The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratical complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v3 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to an instance-dependent regret upper bound, roughly improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.
    A Search-Based Testing Approach for Deep Reinforcement Learning Agents. (arXiv:2206.07813v2 [cs.SE] UPDATED)
    Deep Reinforcement Learning (DRL) algorithms have been increasingly employed during the last decade to solve various decision-making problems such as autonomous driving and robotics. However, these algorithms have faced great challenges when deployed in safety-critical environments since they often exhibit erroneous behaviors that can lead to potentially critical errors. One way to assess the safety of DRL agents is to test them to detect possible faults leading to critical failures during their execution. This raises the question of how we can efficiently test DRL policies to ensure their correctness and adherence to safety requirements. Most existing works on testing DRL agents use adversarial attacks that perturb states or actions of the agent. However, such attacks often lead to unrealistic states of the environment. Their main goal is to test the robustness of DRL agents rather than testing the compliance of agents' policies with respect to requirements. Due to the huge state space of DRL environments, the high cost of test execution, and the black-box nature of DRL algorithms, the exhaustive testing of DRL agents is impossible. In this paper, we propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent by effectively searching for failing executions of the agent within a limited testing budget. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes. We apply STARLA on Deep-Q-Learning agents which are widely used as benchmarks and show that it significantly outperforms Random Testing by detecting more faults related to the agent's policy. We also investigate how to extract rules that characterize faulty episodes of the DRL agent using our search results. Such rules can be used to understand the conditions under which the agent fails and thus assess its deployment risks.
    Long Range Graph Benchmark. (arXiv:2206.08164v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) that are based on the message passing (MP) paradigm generally exchange information between 1-hop neighbors to build node representations at each layer. In principle, such networks are not able to capture long-range interactions (LRI) that may be desired or necessary for learning a given task on graphs. Recently, there has been an increasing interest in development of Transformer-based methods for graphs that can consider full node connectivity beyond the original sparse structure, thus enabling the modeling of LRI. However, MP-GNNs that simply rely on 1-hop message passing often fare better in several existing graph benchmarks when combined with positional feature representations, among other innovations, hence limiting the perceived utility and ranking of Transformer-like architectures. Here, we present the Long Range Graph Benchmark (LRGB) with 5 graph learning datasets: PascalVOC-SP, COCO-SP, PCQM-Contact, Peptides-func and Peptides-struct that arguably require LRI reasoning to achieve strong performance in a given task. We benchmark both baseline GNNs and Graph Transformer networks to verify that the models which capture long-range dependencies perform significantly better on these tasks. Therefore, these datasets are suitable for benchmarking and exploration of MP-GNNs and Graph Transformer architectures that are intended to capture LRI.
    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. (arXiv:2209.00735v2 [cs.LG] UPDATED)
    Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the NN learns as well as Gaussian elimination, an efficient algorithm that can be succinctly described. Our architecture combines both recurrent weight sharing between layers and convolutional weight sharing to reduce the number of parameters down to a constant, even though the network itself may have trillions of nodes. While in practice the constants in our analysis are too large to be directly meaningful, our work suggests that the synergy of Recurrent and Convolutional NNs (RCNNs) may be more natural and powerful than either alone, particularly for concisely parameterizing discrete algorithms.
    RenyiCL: Contrastive Representation Learning with Skew Renyi Divergence. (arXiv:2208.06270v2 [stat.ML] UPDATED)
    Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.
    The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization. (arXiv:2205.14546v2 [cs.LG] UPDATED)
    Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned label expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM-v1, a practical version of IRM, can fail in various settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, based on conserving the label-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$, which is free of this flaw. Further, we introduce a simplified, practical version of the MRI formulation called MRI-v1. We prove that for general linear problems, MRI-v1 guarantees invariant predictors given sufficient number of environments. We also empirically demonstrate that MRI-v1 strongly out-performs IRM-v1 and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.
    Counterfactual Supervision-based Information Bottleneck for Out-of-Distribution Generalization. (arXiv:2208.07798v3 [cs.LG] UPDATED)
    Learning invariant (causal) features for out-of-distribution (OOD) generalization has attracted extensive attention recently, and among the proposals invariant risk minimization (IRM) is a notable solution. In spite of its theoretical promise for linear regression, the challenges of using IRM in linear classification problems remain. By introducing the information bottleneck (IB) principle into the learning of IRM, IB-IRM approach has demonstrated its power to solve these challenges. In this paper, we further improve IB-IRM from two aspects. First, we show that the key assumption of support overlap of invariant features used in IB-IRM is strong for the guarantee of OOD generalization and it is still possible to achieve the optimal solution without this assumption. Second, we illustrate two failure modes that IB-IRM (and IRM) could fail for learning the invariant features, and to address such failures, we propose a \textit{Counterfactual Supervision-based Information Bottleneck (CSIB)} learning algorithm that provably recovers the invariant features. By requiring counterfactual inference, CSIB works even when accessing data from a single environment. Empirical experiments on several datasets verify our theoretical results.
    Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards. (arXiv:2206.01293v2 [cs.LG] UPDATED)
    Incrementality, which is used to measure the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent of interest.
    SMPL: Simulated Industrial Manufacturing and Process Control Learning Environments. (arXiv:2206.08851v2 [cs.LG] UPDATED)
    Traditional biological and pharmaceutical manufacturing plants are controlled by human workers or pre-defined thresholds. Modernized factories have advanced process control algorithms such as model predictive control (MPC). However, there is little exploration of applying deep reinforcement learning to control manufacturing plants. One of the reasons is the lack of high fidelity simulations and standard APIs for benchmarking. To bridge this gap, we develop an easy-to-use library that includes five high-fidelity simulation environments: BeerFMTEnv, ReactorEnv, AtropineEnv, PenSimEnv and mAbEnv, which cover a wide range of manufacturing processes. We build these environments on published dynamics models. Furthermore, we benchmark online and offline, model-based and model-free reinforcement learning algorithms for comparisons of follow-up research.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v4 [cs.LG] UPDATED)
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. The prior GTs are constrained to small graphs with a few hundred nodes, here we propose the first architecture with a complexity linear in the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator on graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We provide a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 16 benchmarks and show highly competitive results in all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    Investigation of a Machine learning methodology for the SKA pulsar search pipeline. (arXiv:2209.04430v3 [astro-ph.IM] UPDATED)
    The SKA pulsar search pipeline will be used for real time detection of pulsars. Modern radio telescopes such as SKA will be generating petabytes of data in their full scale of operation. Hence experience-based and data-driven algorithms become indispensable for applications such as candidate detection. Here we describe our findings from testing a state of the art object detection algorithm called Mask R-CNN to detect candidate signatures in the SKA pulsar search pipeline. We have trained the Mask R-CNN model to detect candidate images. A custom annotation tool was developed to mark the regions of interest in large datasets efficiently. We have successfully demonstrated this algorithm by detecting candidate signatures on a simulation dataset. The paper presents details of this work with a highlight on the future prospects.
    Tighter Regret Analysis and Optimization of Online Federated Learning. (arXiv:2205.06491v3 [cs.LG] UPDATED)
    In federated learning (FL), it is commonly assumed that all data are placed at clients in the beginning of machine learning (ML) optimization (i.e., offline learning). However, in many real-world applications, it is expected to proceed in an online fashion. To this end, online FL (OFL) has been introduced, which aims at learning a sequence of global models from decentralized streaming data such that the so-called cumulative regret is minimized. Combining online gradient descent and model averaging, in this framework, FedOGD is constructed as the counterpart of FedSGD in FL. While it can enjoy an optimal sublinear regret, FedOGD suffers from heavy communication costs. In this paper, we present a communication-efficient method (named OFedIQ) by means of intermittent transmission (enabled by client subsampling and periodic transmission) and quantization. For the first time, we derive the regret bound that captures the impact of data-heterogeneity and the communication-efficient techniques. Through this, we efficiently optimize the parameters of OFedIQ such as sampling rate, transmission period, and quantization levels. Also, it is proved that the optimized OFedIQ can asymptotically achieve the performance of FedOGD while reducing the communication costs by 99%. Via experiments with real datasets, we demonstrate the effectiveness of the optimized OFedIQ.
    Split-kl and PAC-Bayes-split-kl Inequalities for Ternary Random Variables. (arXiv:2206.00706v2 [stat.ML] UPDATED)
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality is particularly well-suited for ternary random variables, which naturally show up in a variety of problems, including analysis of excess losses in classification, analysis of weighted majority votes, and learning with abstention. We demonstrate that for ternary random variables the inequality is simultaneously competitive with the kl inequality, the Empirical Bernstein inequality, and the Unexpected Bernstein inequality, and in certain regimes outperforms all of them. It resolves an open question by Tolstikhin and Seldin [2013] and Mhammedi et al. [2019] on how to match simultaneously the combinatorial power of the kl inequality when the distribution happens to be close to binary and the power of Bernstein inequalities to exploit low variance when the probability mass is concentrated on the middle value. We also derive a PAC-Bayes-split-kl inequality and compare it with the PAC-Bayes-kl, PAC-Bayes-Empirical-Bennett, and PAC-Bayes-Unexpected-Bernstein inequalities in an analysis of excess losses and in an analysis of a weighted majority vote for several UCI datasets. Last but not least, our study provides the first direct comparison of the Empirical Bernstein and Unexpected Bernstein inequalities and their PAC-Bayes extensions.
    Bugs in Machine Learning-based Systems: A Faultload Benchmark. (arXiv:2206.12311v2 [cs.SE] UPDATED)
    The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and integrating them into the ML-based system safely. Although most of these tools use bugs' lifecycle, there is no standard benchmark of bugs to assess their performance, compare them and discuss their advantages and weaknesses. In this study, we firstly investigate the reproducibility and verifiability of the bugs in ML-based systems and show the most important factors in each one. Then, we explore the challenges of generating a benchmark of bugs in ML-based software systems and provide a bug benchmark namely defect4ML that satisfies all criteria of standard benchmark, i.e. relevance, reproducibility, fairness, verifiability, and usability. This faultload benchmark contains 100 bugs reported by ML developers in GitHub and Stack Overflow, using two of the most popular ML frameworks: TensorFlow and Keras. defect4ML also addresses important challenges in Software Reliability Engineering of ML-based software systems, like: 1) fast changes in frameworks, by providing various bugs for different versions of frameworks, 2) code portability, by delivering similar bugs in different ML frameworks, 3) bug reproducibility, by providing fully reproducible bugs with complete information about required dependencies and data, and 4) lack of detailed information on bugs, by presenting links to the bugs' origins. defect4ML can be of interest to ML-based systems practitioners and researchers to assess their testing tools and techniques.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v5 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v3 [cs.LG] UPDATED)
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.
    TaSIL: Taylor Series Imitation Learning. (arXiv:2205.14812v2 [cs.LG] UPDATED)
    We propose Taylor Series Imitation Learning (TaSIL), a simple augmentation to standard behavior cloning losses in the context of continuous control. TaSIL penalizes deviations in the higher-order Taylor series terms between the learned and expert policies. We show that experts satisfying a notion of $\textit{incremental input-to-state stability}$ are easy to learn, in the sense that a small TaSIL-augmented imitation loss over expert trajectories guarantees a small imitation loss over trajectories generated by the learned policy. We provide sample-complexity bounds for TaSIL that scale as $\tilde{\mathcal{O}}(1/n)$ in the realizable setting, for $n$ the number of expert demonstrations. Finally, we demonstrate experimentally the relationship between the robustness of the expert policy and the order of Taylor expansion required in TaSIL, and compare standard Behavior Cloning, DART, and DAgger with TaSIL-loss-augmented variants. In all cases, we show significant improvement over baselines across a variety of MuJoCo tasks.
    Bayesian Active Learning with Fully Bayesian Gaussian Processes. (arXiv:2205.10186v3 [cs.LG] UPDATED)
    The bias-variance trade-off is a well-known problem in machine learning that only gets more pronounced the less available data there is. In active learning, where labeled data is scarce or difficult to obtain, neglecting this trade-off can cause inefficient and non-optimal querying, leading to unnecessary data labeling. In this paper, we focus on active learning with Gaussian Processes (GPs). For the GP, the bias-variance trade-off is made by optimization of the two hyperparameters: the length scale and noise-term. Considering that the optimal mode of the joint posterior of the hyperparameters is equivalent to the optimal bias-variance trade-off, we approximate this joint posterior and utilize it to design two new acquisition functions. The first one is a Bayesian variant of Query-by-Committee (B-QBC), and the second is an extension that explicitly minimizes the predictive variance through a Query by Mixture of Gaussian Processes (QB-MGP) formulation. Across six simulators, we empirically show that B-QBC, on average, achieves the best marginal likelihood, whereas QB-MGP achieves the best predictive performance. We show that incorporating the bias-variance trade-off in the acquisition functions mitigates unnecessary and expensive data labeling.
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v3 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v5 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v4 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.
    TransBoost: Improving the Best ImageNet Performance using Deep Transduction. (arXiv:2205.13331v4 [cs.CV] UPDATED)
    This paper deals with deep transductive learning, and proposes TransBoost as a procedure for fine-tuning any deep neural model to improve its performance on any (unlabeled) test set provided at training time. TransBoost is inspired by a large margin principle and is efficient and simple to use. Our method significantly improves the ImageNet classification performance on a wide range of architectures, such as ResNets, MobileNetV3-L, EfficientNetB0, ViT-S, and ConvNext-T, leading to state-of-the-art transductive performance. Additionally we show that TransBoost is effective on a wide variety of image classification datasets. The implementation of TransBoost is provided at: https://github.com/omerb01/TransBoost .
    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?. (arXiv:2206.05266v4 [cs.LG] UPDATED)
    We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform evolutionary searches to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. After evaluating these approaches together in multiple different environments including a real-world robot environment, we confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we conduct the ablation study on multiple factors and demonstrate the properties of representations learned with different approaches.
    Fast Bayesian Coresets via Subsampling and Quasi-Newton Refinement. (arXiv:2203.09675v3 [stat.ML] UPDATED)
    Bayesian coresets approximate a posterior distribution by building a small weighted subset of the data points. Any inference procedure that is too computationally expensive to be run on the full posterior can instead be run inexpensively on the coreset, with results that approximate those on the full data. However, current approaches are limited by either a significant run-time or the need for the user to specify a low-cost approximation to the full posterior. We propose a Bayesian coreset construction algorithm that first selects a uniformly random subset of data, and then optimizes the weights using a novel quasi-Newton method. Our algorithm is a simple to implement, black-box method, that does not require the user to specify a low-cost posterior approximation. It is the first to come with a general high-probability bound on the KL divergence of the output coreset posterior. Experiments demonstrate that our method provides significant improvements in coreset quality against alternatives with comparable construction times, with far less storage cost and user input required.
    Curriculum Learning for Goal-Oriented Semantic Communications with a Common Language. (arXiv:2204.10429v2 [cs.NI] UPDATED)
    Goal-oriented semantic communication will be a pillar of next-generation wireless networks. Despite significant recent efforts in this area, most prior works are focused on specific data types (e.g., image or audio), and they ignore the goal and effectiveness aspects of semantic transmissions. In contrast, in this paper, a holistic goal-oriented semantic communication framework is proposed to enable a speaker and a listener to cooperatively execute a set of sequential tasks in a dynamic environment. A common language based on a hierarchical belief set is proposed to enable semantic communications between speaker and listener. The speaker, acting as an observer of the environment, utilizes the beliefs to transmit an initial description of its observation (called event) to the listener. The listener is then able to infer on the transmitted description and complete it by adding related beliefs to the transmitted beliefs of the speaker. As such, the listener reconstructs the observed event based on the completed description, and it then takes appropriate action in the environment based on the reconstructed event. An optimization problem is defined to determine the perfect and abstract description of the events while minimizing the transmission and inference costs with constraints on the task execution time and belief efficiency. Then, a novel bottom-up curriculum learning (CL) framework based on reinforcement learning is proposed to solve the optimization problem and enable the speaker and listener to gradually identify the structure of the belief set and the perfect and abstract description of the events. Simulation results show that the proposed CL method outperforms traditional RL in terms of convergence time, task execution cost and time, reliability, and belief efficiency.
    Learning Neural Acoustic Fields. (arXiv:2204.00628v2 [cs.SD] UPDATED)
    Our environment is filled with rich and dynamic acoustic information. When we walk into a cathedral, the reverberations as much as appearance inform us of the sanctuary's wide open space. Similarly, as an object moves around us, we expect the sound emitted to also exhibit this movement. While recent advances in learned implicit functions have led to increasingly higher quality representations of the visual world, there have not been commensurate advances in learning spatial auditory representations. To address this gap, we introduce Neural Acoustic Fields (NAFs), an implicit representation that captures how sounds propagate in a physical scene. By modeling acoustic propagation in a scene as a linear time-invariant system, NAFs learn to continuously map all emitter and listener location pairs to a neural impulse response function that can then be applied to arbitrary sounds. We demonstrate that the continuous nature of NAFs enables us to render spatial acoustics for a listener at an arbitrary location, and can predict sound propagation at novel locations. We further show that the representation learned by NAFs can help improve visual learning with sparse views. Finally, we show that a representation informative of scene structure emerges during the learning of NAFs.
    Multi-sensor large-scale dataset for multi-view 3D reconstruction. (arXiv:2203.06111v2 [cs.CV] UPDATED)
    We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. The data for each scene is obtained under a large number of lighting conditions, and the scenes are selected to emphasize a diverse set of material properties challenging for existing algorithms. Overall, we provide around 1.4 million images of 107 different scenes acquired at 14 lighting conditions from 100 viewing directions. We expect our dataset will be useful for evaluation and training of 3D reconstruction algorithms of different types and for other related tasks.
    Off-Policy Evaluation with Online Adaptation for Robot Exploration in Challenging Environments. (arXiv:2204.03140v2 [cs.RO] UPDATED)
    Autonomous exploration has many important applications. However, classic information gain-based or frontier-based exploration only relies on the robot current state to determine the immediate exploration goal, which lacks the capability of predicting the value of future states and thus leads to inefficient exploration decisions. This paper presents a method to learn how "good" states are, measured by the state value function, to provide a guidance for robot exploration in real-world challenging environments. We formulate our work as a off-policy evaluation (OPE) problem for robot exploration (OPERE). It consists of offline Monte-Carlo training on real-world data and performs Temporal Difference (TD) online adaptation to optimize the trained value estimator. We also design an intrinsic reward function based on sensor information coverage to enable the robot to gain more information with sparse extrinsic rewards. Results demonstrate that our method enables the robot to predict the value of future states so as to better guide robot exploration. The proposed algorithm achieves better prediction performance compared with other state-of-the-art OPE methods. To the best of our knowledge, this work for the first time demonstrates value function prediction on real-world dataset for robot exploration in challenging subterranean and urban environments. More details and demo videos can be found at https://jeffreyyh.github.io/opere/.
    coVariance Neural Networks. (arXiv:2205.15856v4 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.
    Recipes for when Physics Fails: Recovering Robust Learning of Physics Informed Neural Networks. (arXiv:2110.13330v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have been shown to be effective in solving partial differential equations by capturing the physics induced constraints as a part of the training loss function. This paper shows that a PINN can be sensitive to errors in training data and overfit itself in dynamically propagating these errors over the domain of the solution of the PDE. It also shows how physical regularizations based on continuity criteria and conservation laws fail to address this issue and rather introduce problems of their own causing the deep network to converge to a physics-obeying local minimum instead of the global minimum. We introduce Gaussian Process (GP) based smoothing that recovers the performance of a PINN and promises a robust architecture against noise/errors in measurements. Additionally, we illustrate an inexpensive method of quantifying the evolution of uncertainty based on the variance estimation of GPs on boundary data. Robust PINN performance is also shown to be achievable by choice of sparse sets of inducing points based on sparsely induced GPs. We demonstrate the performance of our proposed methods and compare the results from existing benchmark models in literature for time-dependent Schr\"odinger and Burgers' equations.
    Adaptive Composite Online Optimization: Predictions in Static and Dynamic Environments. (arXiv:2205.00446v2 [math.OC] UPDATED)
    In the past few years, Online Convex Optimization (OCO) has received notable attention in the control literature thanks to its flexible real-time nature and powerful performance guarantees. In this paper, we propose new step-size rules and OCO algorithms that simultaneously exploit gradient predictions, function predictions and dynamics, features particularly pertinent to control applications. The proposed algorithms enjoy static and dynamic regret bounds in terms of the dynamics of the reference action sequence, gradient prediction error, and function prediction error, which are generalizations of known regularity measures from the literature. We present results for both convex and strongly convex costs. We validate the performance of the proposed algorithms in a trajectory tracking case study, as well as portfolio optimization using real-world datasets.
    Toward Explainable AI for Regression Models. (arXiv:2112.11407v2 [cs.LG] UPDATED)
    In addition to the impressive predictive power of machine learning (ML) models, more recently, explanation methods have emerged that enable an interpretation of complex non-linear learning models such as deep neural networks. Gaining a better understanding is especially important e.g. for safety-critical ML applications or medical diagnostics etc. While such Explainable AI (XAI) techniques have reached significant popularity for classifiers, so far little attention has been devoted to XAI for regression models (XAIR). In this review, we clarify the fundamental conceptual differences of XAI for regression and classification tasks, establish novel theoretical insights and analysis for XAIR, provide demonstrations of XAIR on genuine practical regression problems, and finally discuss the challenges remaining for the field.
    Generalization Error Bounds for Multiclass Sparse Linear Classifiers. (arXiv:2204.06264v2 [math.ST] UPDATED)
    We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.
    Smoothed Online Combinatorial Optimization Using Imperfect Predictions. (arXiv:2204.10979v2 [cs.LG] UPDATED)
    Smoothed online combinatorial optimization considers a learner who repeatedly chooses a combinatorial decision to minimize an unknown changing cost function with a penalty on switching decisions in consecutive rounds. We study smoothed online combinatorial optimization problems when an imperfect predictive model is available, where the model can forecast the future cost functions with uncertainty. We show that using predictions to plan for a finite time horizon leads to regret dependent on the total predictive uncertainty and an additional switching cost. This observation suggests choosing a suitable planning window to balance between uncertainty and switching cost, which leads to an online algorithm with guarantees on the upper and lower bounds of the cumulative regret. Empirically, our algorithm shows a significant improvement in cumulative regret compared to other baselines in synthetic online distributed streaming problems.
    3D-C2FT: Coarse-to-fine Transformer for Multi-view 3D Reconstruction. (arXiv:2205.14575v2 [cs.CV] UPDATED)
    Recently, the transformer model has been successfully employed for the multi-view 3D reconstruction problem. However, challenges remain on designing an attention mechanism to explore the multiview features and exploit their relations for reinforcing the encoding-decoding modules. This paper proposes a new model, namely 3D coarse-to-fine transformer (3D-C2FT), by introducing a novel coarse-to-fine(C2F) attention mechanism for encoding multi-view features and rectifying defective 3D objects. C2F attention mechanism enables the model to learn multi-view information flow and synthesize 3D surface correction in a coarse to fine-grained manner. The proposed model is evaluated by ShapeNet and Multi-view Real-life datasets. Experimental results show that 3D-C2FT achieves notable results and outperforms several competing models on these datasets.
    Compositional Visual Generation with Composable Diffusion Models. (arXiv:2206.01714v6 [cs.CV] UPDATED)
    Large text-guided diffusion models, such as DALLE-2, are able to generate stunning photorealistic images given natural language descriptions. While such models are highly flexible, they struggle to understand the composition of certain concepts, such as confusing the attributes of different objects or relations between objects. In this paper, we propose an alternative structured approach for compositional generation using diffusion models. An image is generated by composing a set of diffusion models, with each of them modeling a certain component of the image. To do this, we interpret diffusion models as energy-based models in which the data distributions defined by the energy functions may be explicitly combined. The proposed method can generate scenes at test time that are substantially more complex than those seen in training, composing sentence descriptions, object relations, human facial attributes, and even generalizing to new combinations that are rarely seen in the real world. We further illustrate how our approach may be used to compose pre-trained text-guided diffusion models and generate photorealistic images containing all the details described in the input descriptions, including the binding of certain object attributes that have been shown difficult for DALLE-2. These results point to the effectiveness of the proposed method in promoting structured generalization for visual generation. Project page: https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/
    Scalable Decision-Focused Learning in Restless Multi-Armed Bandits with Application to Maternal and Child Health. (arXiv:2202.00916v3 [cs.LG] UPDATED)
    This paper studies restless multi-armed bandit (RMAB) problems with unknown arm transition dynamics but with known correlated arm features. The goal is to learn a model to predict transition dynamics given features, where the Whittle index policy solves the RMAB problems using predicted transitions. However, prior works often learn the model by maximizing the predictive accuracy instead of final RMAB solution quality, causing a mismatch between training and evaluation objectives. To address this shortcoming, we propose a novel approach for decision-focused learning in RMAB that directly trains the predictive model to maximize the Whittle index solution quality. We present three key contributions: (i) we establish differentiability of the Whittle index policy to support decision-focused learning; (ii) we significantly improve the scalability of decision-focused learning approaches in sequential problems, specifically RMAB problems; (iii) we apply our algorithm to a previously collected dataset of maternal and child health to demonstrate its performance. Indeed, our algorithm is the first for decision-focused learning in RMAB that scales to real-world problem sizes.
    Handling Bias in Toxic Speech Detection: A Survey. (arXiv:2202.00126v3 [cs.SI] UPDATED)
    Detecting online toxicity has always been a challenge due to its inherent subjectivity. Factors such as the context, geography, socio-political climate, and background of the producers and consumers of the posts play a crucial role in determining if the content can be flagged as toxic. Adoption of automated toxicity detection models in production can thus lead to a sidelining of the various groups they aim to help in the first place. It has piqued researchers' interest in examining unintended biases and their mitigation. Due to the nascent and multi-faceted nature of the work, complete literature is chaotic in its terminologies, techniques, and findings. In this paper, we put together a systematic study of the limitations and challenges of existing methods for mitigating bias in toxicity detection. We look closely at proposed methods for evaluating and mitigating bias in toxic speech detection. To examine the limitations of existing methods, we also conduct a case study to introduce the concept of bias shift due to knowledge-based bias mitigation. The survey concludes with an overview of the critical challenges, research gaps, and future directions. While reducing toxicity on online platforms continues to be an active area of research, a systematic study of various biases and their mitigation strategies will help the research community produce robust and fair models.
    A Ranking Game for Imitation Learning. (arXiv:2202.03481v3 [cs.LG] UPDATED)
    We propose a new framework for imitation learning -- treating imitation as a two-player ranking-based game between a policy and a reward. In this game, the reward agent learns to satisfy pairwise performance rankings between behaviors, while the policy agent learns to maximize this reward. In imitation learning, near-optimal expert data can be difficult to obtain, and even in the limit of infinite data cannot imply a total ordering over trajectories as preferences can. On the other hand, learning from preferences alone is challenging as a large number of preferences are required to infer a high-dimensional reward function, though preference data is typically much easier to collect than expert demonstrations. The classical inverse reinforcement learning (IRL) formulation learns from expert demonstrations but provides no mechanism to incorporate learning from offline preferences and vice versa. We instantiate the proposed ranking-game framework with a novel ranking loss giving an algorithm that can simultaneously learn from expert demonstrations and preferences, gaining the advantages of both modalities. Our experiments show that the proposed method achieves state-of-the-art sample efficiency and can solve previously unsolvable tasks in the Learning from Observation (LfO) setting. Project video and code can be found at https://hari-sikchi.github.io/rank-game/
    Dynamic Combination of Heterogeneous Models for Hierarchical Time Series. (arXiv:2112.11669v2 [cs.LG] UPDATED)
    We introduce a framework to dynamically combine heterogeneous models called \texttt{DYCHEM}, which forecasts a set of time series that are related through an aggregation hierarchy. Different types of forecasting models can be employed as individual ``experts'' so that each model is tailored to the nature of the corresponding time series. \texttt{DYCHEM} learns hierarchical structures during the training stage to help generalize better across all the time series being modeled and also mitigates coherency issues that arise due to constraints imposed by the hierarchy. To improve the reliability of forecasts, we construct quantile estimations based on the point forecasts obtained from combined heterogeneous models. The resulting quantile forecasts are coherent and independent of the choice of forecasting models. We conduct a comprehensive evaluation of both point and quantile forecasts for hierarchical time series (HTS), including public data and user records from a large financial software company. In general, our method is robust, adaptive to datasets with different properties, and highly configurable and efficient for large-scale forecasting pipelines.
    $A^{3}D$: A Platform of Searching for Robust Neural Architectures and Efficient Adversarial Attacks. (arXiv:2203.03128v2 [cs.LG] UPDATED)
    The robustness of deep neural networks (DNN) models has attracted increasing attention due to the urgent need for security in many applications. Numerous existing open-sourced tools or platforms are developed to evaluate the robustness of DNN models by ensembling the majority of adversarial attack or defense algorithms. Unfortunately, current platforms do not possess the ability to optimize the architectures of DNN models or the configuration of adversarial attacks to further enhance the robustness of models or the performance of adversarial attacks. To alleviate these problems, in this paper, we first propose a novel platform called auto adversarial attack and defense ($A^{3}D$), which can help search for robust neural network architectures and efficient adversarial attacks. In $A^{3}D$, we employ multiple neural architecture search methods, which consider different robustness evaluation metrics, including four types of noises: adversarial noise, natural noise, system noise, and quantified metrics, resulting in finding robust architectures. Besides, we propose a mathematical model for auto adversarial attack, and provide multiple optimization algorithms to search for efficient adversarial attacks. In addition, we combine auto adversarial attack and defense together to form a unified framework. Among auto adversarial defense, the searched efficient attack can be used as the new robustness evaluation to further enhance the robustness. In auto adversarial attack, the searched robust architectures can be utilized as the threat model to help find stronger adversarial attacks. Experiments on CIFAR10, CIFAR100, and ImageNet datasets demonstrate the feasibility and effectiveness of the proposed platform, which can also provide a benchmark and toolkit for researchers in the application of automated machine learning in evaluating and improving the DNN model robustnesses.
    One-Step Abductive Multi-Target Learning with Diverse Noisy Samples and Its Application to Tumour Segmentation for Breast Cancer. (arXiv:2110.10325v8 [cs.LG] UPDATED)
    Recent studies have demonstrated the effectiveness of the combination of machine learning and logical reasoning, including data-driven logical reasoning, knowledge driven machine learning and abductive learning, in inventing advanced artificial intelligence technologies. One-step abductive multi-target learning (OSAMTL), an approach inspired by abductive learning, via simply combining machine learning and logical reasoning in a one-step balanced way, has as well shown its effectiveness in handling complex noisy labels of a single noisy sample in medical histopathology whole slide image analysis (MHWSIA). However, OSAMTL is not suitable for the situation where diverse noisy samples (DiNS) are provided for a learning task. In this paper, giving definition of DiNS, we propose one-step abductive multi-target learning with DiNS (OSAMTL-DiNS) to expand the original OSAMTL to handle complex noisy labels of DiNS. Applying OSAMTL-DiNS to tumour segmentation for breast cancer in MHWSIA, we show that OSAMTL-DiNS is able to enable various state-of-the-art approaches for learning from noisy labels to achieve more rational predictions.
    Gap Minimization for Knowledge Sharing and Transfer. (arXiv:2201.11231v2 [cs.LG] UPDATED)
    Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of \emph{performance gap}, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g., $\mathcal{H}$-divergence or discrepancy distance), we theoretically show that the performance gap can be viewed as a data- and algorithm-dependent regularizer, which controls the model complexity and leads to finer guarantees. More importantly, it also provides new insights and motivates a novel principle for designing strategies for knowledge sharing and transfer: gap minimization. We instantiate this principle with two algorithms: 1. gapBoost, a novel and principled boosting algorithm that explicitly minimizes the performance gap between source and target domains for transfer learning; and 2. gapMTNN, a representation learning algorithm that reformulates gap minimization as semantic conditional matching for multitask learning. Our extensive evaluation on both transfer learning and multitask learning benchmark data sets shows that our methods outperform existing baselines.
    Forecasting Market Changes using Variational Inference. (arXiv:2205.00605v2 [q-fin.ST] UPDATED)
    Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S\&P change is considered and it is shown that even with as few as three explanatory variables, the proposed approach provides useful predictions.
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v3 [cond-mat.stat-mech] UPDATED)
    We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo (NMCMC) simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We discuss several estimates of autocorrelation times in the context of NMCMC, some inspired by analytical results derived for the Metropolized Independent Sampler (MIS). We check their reliability by estimating them on a small system where analytical results can also be obtained. Based on the analytical results for MIS we propose a new loss function and study its impact on the autocorelation times. Although, this function's performance is a bit inferior to the traditional Kullback-Leibler divergence, it offers two training algorithms which in some situations may be beneficial. By studying a small, $4 \times 4$, system we gain access to the dynamics of the training process which we visualize using several observables. Furthermore, we quantitatively investigate the impact of imposing global discrete symmetries of the system in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates which considerably improves the quality of the training. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations for more complicated models.
    Federated Learning with Heterogeneous Differential Privacy. (arXiv:2110.15252v2 [cs.LG] UPDATED)
    Federated learning (FL) takes a first step towards privacy-preserving machine learning by training models while keeping client data local. Models trained using FL may still leak private client information through model updates during training. Differential privacy (DP) may be employed on model updates to provide privacy guarantees within FL, typically at the cost of degraded performance of the final trained model. Both non-private FL and DP-FL can be solved using variants of the federated averaging (FedAvg) algorithm. In this work, we consider a heterogeneous DP setup where clients require varying degrees of privacy guarantees. First, we analyze the optimal solution to the federated linear regression problem with heterogeneous DP in a Bayesian setup. We find that unlike the non-private setup, where the optimal solution for homogeneous data amounts to a single global solution for all clients learned through FedAvg, the optimal solution for each client in this setup would be a personalized one even for homogeneous data. We also analyze the privacy-utility trade-off for this setup, where we characterize the gain obtained from heterogeneous privacy where some clients opt for less strict privacy guarantees. We propose a new algorithm for FL with heterogeneous DP, named FedHDP, which employs personalization and weighted averaging at the server using the privacy choices of clients, to achieve better performance on clients' local models. Through numerical experiments, we show that FedHDP provides up to $9.27\%$ performance gain compared to the baseline DP-FL for the considered datasets where $5\%$ of clients opt out of DP. Additionally, we show a gap in the average performance of local models between non-private and private clients of up to $3.49\%$, empirically illustrating that the baseline DP-FL might incur a large utility cost when not all clients require the stricter privacy guarantees.
    MANDERA: Malicious Node Detection in Federated Learning via Ranking. (arXiv:2110.11736v2 [cs.LG] UPDATED)
    Byzantine attacks hinder the deployment of federated learning algorithms. Although we know that the benign gradients and Byzantine attacked gradients are distributed differently, to detect the malicious gradients is challenging due to (1) the gradient is high-dimensional and each dimension has its unique distribution and (2) the benign gradients and the attacked gradients are always mixed (two-sample test methods cannot apply directly). To address the above, for the first time, we propose MANDERA which is theoretically guaranteed to efficiently detect all malicious gradients under Byzantine attacks with no prior knowledge or history about the number of attacked nodes. More specifically, we transfer the original updating gradient space into a ranking matrix. By such an operation, the scales of different dimensions of the gradients in the ranking space become identical. The high-dimensional benign gradients and the malicious gradients can be easily separated. The effectiveness of MANDERA is further confirmed by experimentation on four Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean), comparing with state-of-the-art defenses. The experiments cover both IID and Non-IID datasets.
    A Deep Reinforcement Learning Approach for Online Parcel Assignment. (arXiv:2109.03467v2 [cs.LG] UPDATED)
    In this paper, we investigate the online parcel assignment (OPA) problem, in which each stochastically generated parcel needs to be assigned to a candidate route for delivery to minimize the total cost subject to certain business constraints. The OPA problem is challenging due to its stochastic nature: each parcel's candidate routes, which depends on the parcel's origin, destination, weight, etc., are unknown until its order is placed, and the total parcel volume is uncertain in advance. To tackle this challenge, we propose the PPO-OPA algorithm based on deep reinforcement learning that shows competitive performance. More specifically, we introduce a novel Markov Decision Process (MDP) framework to model the OPA problem, and develop a policy gradient algorithm that adopts attention networks for policy evaluation. By designing a dedicated reward function, our proposed algorithm can achieve a lower total cost with smaller violation of constraints, comparing to the traditional method which assigns parcels to candidate routes proportionally. In addition, the performances of our proposed algorithm and the Primal-Dual algorithm are comparable, while the later assumes a known total parcel volume in advance, which is unrealistic in practice.
    First-Order Algorithms for Nonlinear Generalized Nash Equilibrium Problems. (arXiv:2204.03132v2 [math.OC] UPDATED)
    We consider the problem of computing an equilibrium in a class of \textit{nonlinear generalized Nash equilibrium problems (NGNEPs)} in which the strategy sets for each player are defined by the equality and inequality constraints that may depend on the choices of rival players. While the asymptotic global convergence and local convergence rate of certain algorithms have been extensively investigated, the iteration complexity analysis is still in its infancy. This paper provides two first-order algorithms based on quadratic penalty method (QPM) and augmented Lagrangian method (ALM), respectively, with an accelerated mirror-prox algorithm as the solver in each inner loop. We show the nonasymptotic convergence rate for these algorithms. In particular, we establish the global convergence guarantee for solving monotone and strongly monotone NGNEPs and provide the complexity bounds expressed in terms of the number of gradient evaluations. Experimental results demonstrate the efficiency of our algorithms in practice.
    GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy. (arXiv:2104.10569v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been demonstrated as a powerful tool for analyzing non-Euclidean graph data. However, the lack of efficient distributed graph learning systems severely hinders applications of GNNs, especially when graphs are big and GNNs are relatively deep. Herein, we present GraphTheta, the first distributed and scalable graph learning system built upon vertex-centric distributed graph processing with neural network operators implemented as user-defined functions. This system supports multiple training strategies and enables efficient and scalable big-graph learning on distributed (virtual) machines with low memory. To facilitate graph convolutions, GraphTheta puts forward a new graph learning abstraction named NN-TGAR to bridge the gap between graph processing and graph deep learning. A distributed graph engine is proposed to conduct the stochastic gradient descent optimization with a hybrid-parallel execution, and a new cluster-batched training strategy is supported. We evaluate GraphTheta using several datasets with network sizes ranging from small-, modest- to large-scale. Experimental results show that GraphTheta can scale well to 1,024 workers for training an in-house developed GNN on an industry-scale Alipay dataset of 1.4 billion nodes and 4.1 billion attributed edges, with a cluster of CPU virtual machines (dockers) of small memory each (5$\sim$12GB). Moreover, GraphTheta can outperform DistDGL by up to $2.02\times$, with better scalability, and GraphLearn by up to $30.56\times$. As for model accuracy, GraphTheta is capable of learning as good GNNs as existing frameworks. To the best of our knowledge, this work presents the largest edge-attributed GNN learning task in the literature.
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v2 [cs.LG] UPDATED)
    We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev- Cantelli inequality (a.k.a. one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive a new concentration of measure inequality, which we name PAC-Bayes-Bennett, since it combines PAC-Bayesian bounding with Bennett's inequality. We use it for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality of Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work of Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.
    The Prominence of Artificial Intelligence in COVID-19. (arXiv:2111.09537v2 [cs.LG] UPDATED)
    In December 2019, a novel virus called COVID-19 had caused an enormous number of causalities to date. The battle with the novel Coronavirus is baffling and horrifying after the Spanish Flu 2019. While the front-line doctors and medical researchers have made significant progress in controlling the spread of the highly contiguous virus, technology has also proved its significance in the battle. Moreover, Artificial Intelligence has been adopted in many medical applications to diagnose many diseases, even baffling experienced doctors. Therefore, this survey paper explores the methodologies proposed that can aid doctors and researchers in early and inexpensive methods of diagnosis of the disease. Most developing countries have difficulties carrying out tests using the conventional manner, but a significant way can be adopted with Machine and Deep Learning. On the other hand, the access to different types of medical images has motivated the researchers. As a result, a mammoth number of techniques are proposed. This paper first details the background knowledge of the conventional methods in the Artificial Intelligence domain. Following that, we gather the commonly used datasets and their use cases to date. In addition, we also show the percentage of researchers adopting Machine Learning over Deep Learning. Thus we provide a thorough analysis of this scenario. Lastly, in the research challenges, we elaborate on the problems faced in COVID-19 research, and we address the issues with our understanding to build a bright and healthy environment.
    SITTA: Single Image Texture Translation for Data Augmentation. (arXiv:2106.13804v2 [cs.CV] UPDATED)
    Recent advances in data augmentation enable one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of image synthesis methods for recognition tasks. In this paper, we propose and explore the problem of image translation for data augmentation. We first propose a lightweight yet efficient model for translating texture to augment images based on a single input of source texture, allowing for fast training and testing, referred to as Single Image Texture Translation for data Augmentation (SITTA). Then we explore the use of augmented data in long-tailed and few-shot image classification tasks. We find the proposed augmentation method and workflow is capable of translating the texture of input data into a target domain, leading to consistently improved image recognition performance. Finally, we examine how SITTA and related image translation methods can provide a basis for a data-efficient, "augmentation engineering" approach to model training. Codes are available at https://github.com/Boyiliee/SITTA.
    Pruning Edges and Gradients to Learn Hypergraphs from Larger Sets. (arXiv:2106.13919v2 [cs.LG] UPDATED)
    This paper aims for set-to-hypergraph prediction, where the goal is to infer the set of relations for a given set of entities. This is a common abstraction for applications in particle physics, biological systems, and combinatorial optimization. We address two common scaling problems encountered in set-to-hypergraph tasks that limit the size of the input set: the exponentially growing number of hyperedges and the run-time complexity, both leading to higher memory requirements. We make three contributions. First, we propose to predict and supervise the \emph{positive} edges only, which changes the asymptotic memory scaling from exponential to linear. Second, we introduce a training method that encourages iterative refinement of the predicted hypergraph, which allows us to skip iterations in the backward pass for improved efficiency and constant memory usage. Third, we combine both contributions in a single set-to-hypergraph model that enables us to address problems with larger input set sizes. We provide ablations for our main technical contributions and show that our model outperforms prior state-of-the-art, especially for larger sets.
    Genetic algorithm for feature selection of EEG heterogeneous data. (arXiv:2103.07117v2 [cs.NE] UPDATED)
    The electroencephalographic (EEG) signals provide highly informative data on brain activities and functions. However, their heterogeneity and high dimensionality may represent an obstacle for their interpretation. The introduction of a priori knowledge seems the best option to mitigate high dimensionality problems, but could lose some information and patterns present in the data, while data heterogeneity remains an open issue that often makes generalization difficult. In this study, we propose a genetic algorithm (GA) for feature selection that can be used with a supervised or unsupervised approach. Our proposal considers three different fitness functions without relying on expert knowledge. Starting from two publicly available datasets on cognitive workload and motor movement/imagery, the EEG signals are processed, normalized and their features computed in the time, frequency and time-frequency domains. The feature vector selection is performed by applying our GA proposal and compared with two benchmarking techniques. The results show that different combinations of our proposal achieve better results in respect to the benchmark in terms of overall performance and feature reduction. Moreover, the proposed GA, based on a novel fitness function here presented, outperforms the benchmark when the two different datasets considered are merged together, showing the effectiveness of our proposal on heterogeneous data.
    Relay Variational Inference: A Method for Accelerated Encoderless VI. (arXiv:2110.13422v2 [cs.LG] UPDATED)
    Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g. VAE).
    Labels, Information, and Computation: Efficient Learning Using Sufficient Labels. (arXiv:2104.09015v3 [cs.LG] UPDATED)
    In supervised learning, obtaining a large set of fully-labeled training data is expensive. We show that we do not always need full label information on every single training example to train a competent classifier. Specifically, inspired by the principle of sufficiency in statistics, we present a statistic (a summary) of the fully-labeled training set that captures almost all the relevant information for classification but at the same time is easier to obtain directly. We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency for finding the optimal hidden representations, on which competent classifier heads can be trained using as few as a single randomly-chosen fully-labeled example per class. Sufficiently-labeled data can be obtained from annotators directly without collecting the fully-labeled data first. And we prove that it is easier to directly obtain sufficiently-labeled data than obtaining fully-labeled data. Furthermore, sufficiently-labeled data is naturally more secure since it stores relative, instead of absolute, information. Extensive experimental results are provided to support our theory.
    Random Planted Forest: a directly interpretable tree ensemble. (arXiv:2012.14563v2 [stat.ML] UPDATED)
    We introduce a novel interpretable, tree based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional decomposition perspective, where each tree corresponds to a function within that decomposition. The maximal order of approximation in the decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model places no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in cases where the maximal order of approximation is low. We show that if the order is smaller than three, the idealised version achieves asymptotically optimal convergence rates up to a logarithmic factor. ode is available on https://github.com/PlantedML/randomPlantedForest
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v3 [cs.LG] UPDATED)
    While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v3 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.
    On the Sample Complexity of Stability Constrained Imitation Learning. (arXiv:2102.09161v3 [cs.LG] UPDATED)
    We study the following question in the context of imitation learning for continuous control: how are the underlying stability properties of an expert policy reflected in the sample-complexity of an imitation learning task? We provide the first results showing that a surprisingly granular connection can be made between the underlying expert system's incremental gain stability, a novel measure of robust convergence between pairs of system trajectories, and the dependency on the task horizon $T$ of the resulting generalization bounds. In particular, we propose and analyze incremental gain stability constrained versions of behavior cloning and a DAgger-like algorithm, and show that the resulting sample-complexity bounds naturally reflect the underlying stability properties of the expert system. As a special case, we delineate a class of systems for which the number of trajectories needed to achieve $\varepsilon$-suboptimality is sublinear in the task horizon $T$, and do so without requiring (strong) convexity of the loss function in the policy parameters. Finally, we conduct numerical experiments demonstrating the validity of our insights on both a simple nonlinear system for which the underlying stability properties can be easily tuned, and on a high-dimensional quadrupedal robotic simulation.
    Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. (arXiv:2007.01498v2 [cs.AI] UPDATED)
    In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.
    Sustainable Recreational Fishing Using a Novel Electrical Muscle Stimulation (EMS) Lure and Ensemble Network Algorithm to Maximize Catch and Release Survivability. (arXiv:2006.10125v2 [cs.CV] UPDATED)
    With 200-700 million anglers in the world, sportfishing is nearly five times more common than commercial trawling. Worldwide, hundreds of thousands of jobs are linked to the sportfishing industry, which generates billions of dollars for water-side communities and fisheries conservatories alike. However, the sheer popularity of recreational fishing poses threats to aquatic biodiversity that are hard to regulate. For example, as much as 25% of overfished populations can be traced to anglers. This alarming statistic is explained by the average catch and release mortality rate of 43%, which primarily results from hook-related injuries and careless out-of-water handling. The provisional-patented design proposed in this paper addresses both these problems separately First, a novel, electrical muscle stimulation based fishing lure is proposed as a harmless and low cost alternative to sharp hooks. Early prototypes show a constant electrical current of 90 mA applied through a 200g European perch's jaw can support a reeling tension of 2N - safely within the necessary ranges. Second, a fish-eye camera bob is designed to wirelessly relay underwater footage to a smartphone app, where an ensemble convolutional neural network automatically classifies the fish's species, estimates its length, and cross references with local and state fishing regulations (ie. minimum size, maximum bag limit, and catch season). This capability reduces overfishing by helping anglers avoid accidentally violating guidelines and eliminates the need to reel the fish in and expose it to negligent handling. IN conjunction, this cheap, lightweight, yet high-tech invention is a paradigm shift in preserving a world favorite pastime; while at the same time making recreational fishing more sustainable.
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v3 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.
    Decentralized Exploration in Multi-Armed Bandits -- Extended version. (arXiv:1811.07763v6 [cs.LG] UPDATED)
    We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm Decentralized Elimination, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v5 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search, and also on a real-world YouTube dataset. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.
    Approximation Theory of Tree Tensor Networks: Tensorized Univariate Functions -- Part I. (arXiv:2007.00118v4 [math.FA] UPDATED)
    We study the approximation of functions by tensor networks (TNs). We show that Lebesgue $L^p$-spaces in one dimension can be identified with tensor product spaces of arbitrary order through tensorization. We use this tensor product structure to define subsets of $L^p$ of rank-structured functions of finite representation complexity. These subsets are then used to define different approximation classes of tensor networks, associated with different measures of complexity. These approximation classes are shown to be quasi-normed linear spaces. We study some elementary properties and relationships of said spaces. In part II of this work, we will show that classical smoothness (Besov) spaces are continuously embedded into these approximation classes. We will also show that functions in these approximation classes do not possess any Besov smoothness, unless one restricts the depth of the tensor networks. The results of this work are both an analysis of the approximation spaces of TNs and a study of the expressivity of a particular type of neural networks (NN) -- namely feed-forward sum-product networks with sparse architecture. The input variables of this network result from the tensorization step, interpreted as a particular featuring step which can also be implemented with a neural network with a specific architecture. We point out interesting parallels to recent results on the expressivity of rectified linear unit (ReLU) networks -- currently one of the most popular type of NNs.
    Approximation Theory of Tree Tensor Networks: Tensorized Univariate Functions -- Part II. (arXiv:2007.00128v4 [math.FA] UPDATED)
    We study the approximation by tensor networks (TNs) of functions from classical smoothness classes. The considered approximation tool combines a tensorization of functions in $L^p([0,1))$, which allows to identify a univariate function with a multivariate function (or tensor), and the use of tree tensor networks (the tensor train format) for exploiting low-rank structures of multivariate functions. The resulting tool can be interpreted as a feed-forward neural network, with first layers implementing the tensorization, interpreted as a particular featuring step, followed by a sum-product network with sparse architecture. In part I of this work, we presented several approximation classes associated with different measures of complexity of tensor networks and studied their properties. In this work (part II), we show how classical approximation tools, such as polynomials or splines (with fixed or free knots), can be encoded as a tensor network with controlled complexity. We use this to derive direct (Jackson) inequalities for the approximation spaces of tensor networks. This is then utilized to show that Besov spaces are continuously embedded into these approximation spaces. In other words, we show that arbitrary Besov functions can be approximated with optimal or near to optimal rate. We also show that an arbitrary function in the approximation class possesses no Besov smoothness, unless one limits the depth of the tensor network.
    Nonlinear Independent Component Analysis for Discrete-Time and Continuous-Time Signals. (arXiv:2102.02876v3 [stat.ML] UPDATED)
    We study the classical problem of recovering a multidimensional source signal from observations of nonlinear mixtures of this signal. We show that this recovery is possible (up to a permutation and monotone scaling of the source's original component signals) if the mixture is due to a sufficiently differentiable and invertible but otherwise arbitrarily nonlinear function and the component signals of the source are statistically independent with 'non-degenerate' second-order statistics. The latter assumption requires the source signal to meet one of three regularity conditions which essentially ensure that the source is sufficiently far away from the non-recoverable extremes of being deterministic or constant in time. These assumptions, which cover many popular time series models and stochastic processes, allow us to reformulate the initial problem of nonlinear blind source separation as a simple-to-state problem of optimisation-based function approximation. We propose to solve this approximation problem by minimizing a novel type of objective function that efficiently quantifies the mutual statistical dependence between multiple stochastic processes via cumulant-like statistics. This yields a scalable and direct new method for nonlinear Independent Component Analysis with widely applicable theoretical guarantees and for which our experiments indicate good performance.
    Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering. (arXiv:1908.01241v4 [cs.LG] UPDATED)
    Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.
    Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning. (arXiv:1909.05850v6 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.
    Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain--Machine Interfaces. (arXiv:2004.11690v3 [eess.SP] UPDATED)
    Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines by analyzing brain activities recorded with Electroencephalography (EEG). Latency, reliability, and privacy constraints make it unsuitable to offload the computation to the cloud. Practical use cases demand a wearable, battery-operated device with low average power consumption for long-term use. Recently, sophisticated algorithms, in particular deep learning models, have emerged for classifying EEG signals. While reaching outstanding accuracy, these models often exceed the limitations of edge devices due to their memory and computational requirements. In this paper, we demonstrate algorithmic and implementation optimizations for EEGNET, a compact Convolutional Neural Network (CNN) suitable for many BMI paradigms. We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0.4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr.Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster. With our proposed optimization steps, we can obtain an overall speedup of 64x and a reduction of up to 85% in memory footprint with respect to a single-core layer-wise baseline implementation. Our implementation takes only 5.82 ms and consumes 0.627 mJ per inference. With 21.0GMAC/s/W, it is 256x more energy-efficient than an EEGNET implementation on an ARM Cortex-M7 (0.082GMAC/s/W).
    Approximation Theory of Tree Tensor Networks: Tensorized Multivariate Functions. (arXiv:2101.11932v2 [math.FA] UPDATED)
    We study the approximation of multivariate functions with tensor networks (TNs). The main conclusion of this work is an answer to the following two questions: "What are the approximation capabilities of TNs?" and "What is an appropriate model class of functions that can be approximated with TNs?" To answer the former: we show that TNs can (near to) optimally replicate $h$-uniform and $h$-adaptive approximation, for any smoothness order of the target function. Tensor networks thus exhibit universal expressivity w.r.t. isotropic, anisotropic and mixed smoothness spaces that is comparable with more general neural networks families such as deep rectified linear unit (ReLU) networks. Put differently, TNs have the capacity to (near to) optimally approximate many function classes -- without being adapted to the particular class in question. To answer the latter: as a candidate model class we consider approximation classes of TNs and show that these are (quasi-)Banach spaces, that many types of classical smoothness spaces are continuously embedded into said approximation classes and that TN approximation classes are themselves not embedded in any classical smoothness space.
    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. (arXiv:2006.14514v4 [cs.LG] UPDATED)
    Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. In many cases, the gradient of any such loss function has superlinear growth, making the use of the widely-accepted (stochastic) gradient descent methods, which are based on Euler numerical schemes, problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithm's convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in \citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in \citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.
    Learned Lifted Linearization Applied to Unstable Dynamic Systems Enabled by Koopman Direct Encoding. (arXiv:2210.13602v2 [cs.LG] UPDATED)
    This paper presents a Koopman lifting linearization method that is applicable to nonlinear dynamical systems having both stable and unstable regions. It is known that DMD and other standard data-driven methods face a fundamental difficulty in constructing a Koopman model when applied to unstable systems. Here we solve the problem by incorporating knowledge about a nonlinear state equation with a learning method for finding an effective set of observables. In a lifted space, stable and unstable regions are separated into independent subspaces. Based on this property, we propose to find effective observables through neural net training where training data are separated into stable and unstable trajectories. The resultant learned observables are used for constructing a linear state transition matrix using method known as Direct Encoding, which transforms the nonlinear state equation to a state transition matrix through inner product computations with the observables. The proposed method shows a dramatic improvement over existing DMD and data-driven methods.
    Identifying Time Lag in Dynamical Systems with Copula Entropy based Transfer Entropy. (arXiv:2301.06037v1 [cs.LG])
    Time lag between variables is a key characteristics of dynamical systems in different fields and identifying such time lag is a central problem in complex systems with many applications. Transfer Entropy (TE) was proposed as a tool for time lag identification recently. Unfortunately, estimating TE has been a notoriously difficult problem. Copula Entropy (CE) is a measure of statistical independence and it was proved that TE can be represented with only CE. Therefore, a non-parametric estimator of TE based on CE was proposed according to such representation recently. In this paper we propose to use the CE-based estimator of TE to identify time lag in dynamical systems. Both simulated and real data are used to verify the effectiveness of the proposed method in the experiments. Experimental results show that the proposed method can identify the time lags in the three simulated systems. The real data experiment with the data on power consumption of the Tetouan city also demonstrates that our method can identify the pattern of time lags through the estimated TE from the weather factors to the power consumption of the city.
    Pluto's Surface Mapping using Unsupervised Learning from Near-Infrared Observations of LEISA/Ralph. (arXiv:2301.06027v1 [astro-ph.EP])
    We map the surface of Pluto using an unsupervised machine learning technique using the near-infrared observations of the LEISA/Ralph instrument onboard NASA's New Horizons spacecraft. The principal component reduced Gaussian mixture model was implemented to investigate the geographic distribution of the surface units across the dwarf planet. We also present the likelihood of each surface unit at the image pixel level. Average I/F spectra of each unit were analyzed -- in terms of the position and strengths of absorption bands of abundant volatiles such as N${}_{2}$, CH${}_{4}$, and CO and nonvolatile H${}_{2}$O -- to connect the unit to surface composition, geology, and geographic location. The distribution of surface units shows a latitudinal pattern with distinct surface compositions of volatiles -- consistent with the existing literature. However, previous mapping efforts were based primarily on compositional analysis using spectral indices (indicators) or implementation of complex radiative transfer models, which need (prior) expert knowledge, label data, or optical constants of representative endmembers. We prove that an application of unsupervised learning in this instance renders a satisfactory result in mapping the spatial distribution of ice compositions without any prior information or label data. Thus, such an application is specifically advantageous for a planetary surface mapping when label data are poorly constrained or completely unknown, because an understanding of surface material distribution is vital for volatile transport modeling at the planetary scale. We emphasize that the unsupervised learning used in this study has wide applicability and can be expanded to other planetary bodies of the Solar System for mapping surface material distribution.
    Self-recovery of memory via generative replay. (arXiv:2301.06030v1 [cs.NE])
    A remarkable capacity of the brain is its ability to autonomously reorganize memories during offline periods. Memory replay, a mechanism hypothesized to underlie biological offline learning, has inspired offline methods for reducing forgetting in artificial neural networks in continual learning settings. A memory-efficient and neurally-plausible method is generative replay, which achieves state of the art performance on continual learning benchmarks. However, unlike the brain, standard generative replay does not self-reorganize memories when trained offline on its own replay samples. We propose a novel architecture that augments generative replay with an adaptive, brain-like capacity to autonomously recover memories. We demonstrate this capacity of the architecture across several continual learning tasks and environments.
    Semantic and Effective Communication for Remote Control Tasks with Dynamic Feature Compression. (arXiv:2301.05901v1 [cs.LG])
    The coordination of robotic swarms and the remote wireless control of industrial systems are among the major use cases for 5G and beyond systems: in these cases, the massive amounts of sensory information that needs to be shared over the wireless medium can overload even high-capacity connections. Consequently, solving the effective communication problem by optimizing the transmission strategy to discard irrelevant information can provide a significant advantage, but is often a very complex task. In this work, we consider a prototypal system in which an observer must communicate its sensory data to an actor controlling a task (e.g., a mobile robot in a factory). We then model it as a remote Partially Observable Markov Decision Process (POMDP), considering the effect of adopting semantic and effective communication-oriented solutions on the overall system performance. We split the communication problem by considering an ensemble Vector Quantized Variational Autoencoder (VQ-VAE) encoding, and train a Deep Reinforcement Learning (DRL) agent to dynamically adapt the quantization level, considering both the current state of the environment and the memory of past messages. We tested the proposed approach on the well-known CartPole reference control problem, obtaining a significant performance increase over traditional approaches
    An Accurate EEGNet-based Motor-Imagery Brain-Computer Interface for Low-Power Edge Computing. (arXiv:2004.00077v3 [eess.SP] UPDATED)
    This paper presents an accurate and robust embedded motor-imagery brain-computer interface (MI-BCI). The proposed novel model, based on EEGNet, matches the requirements of memory footprint and computational resources of low-power microcontroller units (MCUs), such as the ARM Cortex-M family. Furthermore, the paper presents a set of methods, including temporal downsampling, channel selection, and narrowing of the classification window, to further scale down the model to relax memory requirements with negligible accuracy degradation. Experimental results on the Physionet EEG Motor Movement/Imagery Dataset show that standard EEGNet achieves 82.43%, 75.07%, and 65.07% classification accuracy on 2-, 3-, and 4-class MI tasks in global validation, outperforming the state-of-the-art (SoA) convolutional neural network (CNN) by 2.05%, 5.25%, and 5.48%. Our novel method further scales down the standard EEGNet at a negligible accuracy loss of 0.31% with 7.6x memory footprint reduction and a small accuracy loss of 2.51% with 15x reduction. The scaled models are deployed on a commercial Cortex-M4F MCU taking 101ms and consuming 4.28mJ per inference for operating the smallest model, and on a Cortex-M7 with 44ms and 18.1mJ per inference for the medium-sized model, enabling a fully autonomous, wearable, and accurate low-power BCI.
    Micro and Macro Level Graph Modeling for Graph Variational Auto-Encoders. (arXiv:2210.16844v2 [cs.LG] UPDATED)
    Generative models for graph data are an important research topic in machine learning. Graph data comprise two levels that are typically analyzed separately: node-level properties such as the existence of a link between a pair of nodes, and global aggregate graph-level statistics, such as motif counts. This paper proposes a new multi-level framework that jointly models node-level properties and graph-level statistics, as mutually reinforcing sources of information. We introduce a new micro-macro training objective for graph generation that combines node-level and graph-level losses. We utilize the micro-macro objective to improve graph generation with a GraphVAE, a well-established model based on graph-level latent variables, that provides fast training and generation time for medium-sized graphs. Our experiments show that adding micro-macro modeling to the GraphVAE model improves graph quality scores up to 2 orders of magnitude on five benchmark datasets, while maintaining the GraphVAE generation speed advantage.
    EvoAAA: An evolutionary methodology for automated \neural autoencoder architecture search. (arXiv:2301.06047v1 [cs.NE])
    Machine learning models work better when curated features are provided to them. Feature engineering methods have been usually used as a preprocessing step to obtain or build a proper feature set. In late years, autoencoders (a specific type of symmetrical neural network) have been widely used to perform representation learning, proving their competitiveness against classical feature engineering algorithms. The main obstacle in the use of autoencoders is finding a good architecture, a process that most experts confront manually. An automated autoencoder architecture search procedure, based on evolutionary methods, is proposed in this paper. The methodology is tested against nine heterogeneous data sets. The obtained results show the ability of this approach to find better architectures, able to concentrate most of the useful information in a minimized coding, in a reduced time.
    Margin Optimal Classification Trees. (arXiv:2210.10567v4 [math.OC] UPDATED)
    In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
    A Review on the effectiveness of Dimensional Reduction with Computational Forensics: An Application on Malware Analysis. (arXiv:2301.06031v1 [cs.CR])
    The Android operating system is pervasively adopted as the operating system platform of choice for smart devices. However, the strong adoption has also resulted in exponential growth in the number of Android based malicious software or malware. To deal with such cyber threats as part of cyber investigation and digital forensics, computational techniques in the form of machine learning algorithms are applied for such malware identification, detection and forensics analysis. However, such Computational Forensics modelling techniques are constrained the volume, velocity, variety and veracity of the malware landscape. This in turn would affect its identification and detection effectiveness. Such consequence would inherently induce the question of sustainability with such solution approach. One approach to optimise effectiveness is to apply dimensional reduction techniques like Principal Component Analysis with the intent to enhance algorithmic performance. In this paper, we evaluate the effectiveness of the application of Principle Component Analysis on Computational Forensics task of detecting Android based malware. We applied our research hypothesis to three different datasets with different machine learning algorithms. Our research result showed that the dimensionally reduced dataset would result in a measure of degradation in accuracy performance.
    On the role of Model Uncertainties in Bayesian Optimization. (arXiv:2301.05983v1 [stat.ML])
    Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.
    Sinkhorn Divergences for Unbalanced Optimal Transport. (arXiv:1910.12958v3 [math.OC] UPDATED)
    Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.
    On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning. (arXiv:2301.06010v1 [cs.LG])
    When there are unlabeled Out-Of-Distribution (OOD) data from other classes, Semi-Supervised Learning (SSL) methods suffer from severe performance degradation and even get worse than merely training on labeled data. In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL. PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data according to the model's prediction. We aim to answer two main questions: (1) How do OOD data influence PL? (2) What is the proper usage of OOD data with PL? First, we show that the major problem of PL is imbalanced pseudo-labels on OOD data. Second, we find that OOD data can help classify In-Distribution (ID) data given their OOD ground truth labels. Based on the findings, we propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC). RPL re-balances pseudo-labels of high-confidence data, which simultaneously filters out OOD data and addresses the imbalance problem. SEC uses balanced clustering on low-confidence data to create pseudo-labels on extra classes, simulating the process of training with ground truth. Experiments show that our method achieves steady improvement over supervised baseline and state-of-the-art performance under all class mismatch ratios on different benchmarks.
    Interpretable and Scalable Graphical Models for Complex Spatio-temporal Processes. (arXiv:2301.06021v1 [cs.LG])
    This thesis focuses on data that has complex spatio-temporal structure and on probabilistic graphical models that learn the structure in an interpretable and scalable manner. We target two research areas of interest: Gaussian graphical models for tensor-variate data and summarization of complex time-varying texts using topic models. This work advances the state-of-the-art in several directions. First, it introduces a new class of tensor-variate Gaussian graphical models via the Sylvester tensor equation. Second, it develops an optimization technique based on a fast-converging proximal alternating linearized minimization method, which scales tensor-variate Gaussian graphical model estimations to modern big-data settings. Third, it connects Kronecker-structured (inverse) covariance models with spatio-temporal partial differential equations (PDEs) and introduces a new framework for ensemble Kalman filtering that is capable of tracking chaotic physical systems. Fourth, it proposes a modular and interpretable framework for unsupervised and weakly-supervised probabilistic topic modeling of time-varying data that combines generative statistical models with computational geometric methods. Throughout, practical applications of the methodology are considered using real datasets. This includes brain-connectivity analysis using EEG data, space weather forecasting using solar imaging data, longitudinal analysis of public opinions using Twitter data, and mining of mental health related issues using TalkLife data. We show in each case that the graphical modeling framework introduced here leads to improved interpretability, accuracy, and scalability.
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.
    Static, dynamic and stability analysis of multi-dimensional functional graded plate with variable thickness using deep neural network. (arXiv:2301.05900v1 [cs.LG])
    The goal of this paper is to analyze and predict the central deflection, natural frequency, and critical buckling load of the multi-directional functionally graded (FG) plate with variable thickness resting on an elastic Winkler foundation. First, the mathematical models of the static and eigenproblems are formulated in great detail. The FG material properties are assumed to vary smoothly and continuously throughout three directions of the plate according to a Mori-Tanaka micromechanics model distribution of volume fraction of constituents. Then, finite element analysis (FEA) with mixed interpolation of tensorial components of 4-nodes (MITC4) is implemented in order to eliminate theoretically a shear locking phenomenon existing. Next, influences of the variable thickness functions (uniform, non-uniform linear, and non-uniform non-linear), material properties, length-to-thickness ratio, boundary conditions, and elastic parameters on the plate response are investigated and discussed in detail through several numerical examples. Finally, a deep neural network (DNN) technique using batch normalization (BN) is learned to predict the non-dimensional values of multi-directional FG plates. The DNN model also shows that it is a powerful technique capable of handling an extensive database and different vital parameters in engineering applications.
    Deep-Reinforcement-Learning-based Path Planning for Industrial Robots using Distance Sensors as Observation. (arXiv:2301.05980v1 [cs.RO])
    Industrial robots are widely used in various manufacturing environments due to their efficiency in doing repetitive tasks such as assembly or welding. A common problem for these applications is to reach a destination without colliding with obstacles or other robot arms. Commonly used sampling-based path planning approaches such as RRT require long computation times, especially in complex environments. Furthermore, the environment in which they are employed needs to be known beforehand. When utilizing the approaches in new environments, a tedious engineering effort in setting hyperparameters needs to be conducted, which is time- and cost-intensive. On the other hand, Deep Reinforcement Learning has shown remarkable results in dealing with unknown environments, generalizing new problem instances, and solving motion planning problems efficiently. On that account, this paper proposes a Deep-Reinforcement-Learning-based motion planner for robotic manipulators. We evaluated our model against state-of-the-art sampling-based planners in several experiments. The results show the superiority of our planner in terms of path length and execution time.
    Recent advances in artificial intelligence for retrosynthesis. (arXiv:2301.05864v1 [cs.LG])
    Retrosynthesis is the cornerstone of organic chemistry, providing chemists in material and drug manufacturing access to poorly available and brand-new molecules. Conventional rule-based or expert-based computer-aided synthesis has obvious limitations, such as high labor costs and limited search space. In recent years, dramatic breakthroughs driven by artificial intelligence have revolutionized retrosynthesis. Here we aim to present a comprehensive review of recent advances in AI-based retrosynthesis. For single-step and multi-step retrosynthesis both, we first list their goal and provide a thorough taxonomy of existing methods. Afterwards, we analyze these methods in terms of their mechanism and performance, and introduce popular evaluation metrics for them, in which we also provide a detailed comparison among representative methods on several public datasets. In the next part we introduce popular databases and established platforms for retrosynthesis. Finally, this review concludes with a discussion about promising research directions in this field.
    Drug Synergistic Combinations Predictions via Large-Scale Pre-Training and Graph Structure Learning. (arXiv:2301.05931v1 [cs.LG])
    Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation. However, identifying novel drug combinations through wet-lab experiments is resource intensive due to the vast combinatorial search space. Recently, computational approaches, specifically deep learning models have emerged as an efficient way to discover synergistic combinations. While previous methods reported fair performance, their models usually do not take advantage of multi-modal data and they are unable to handle new drugs or cell lines. In this study, we collected data from various datasets covering various drug-related aspects. Then, we take advantage of large-scale pre-training models to generate informative representations and features for drugs, proteins, and diseases. Based on that, a message-passing graph is built on top to propagate information together with graph structure learning flexibility. This is first introduced in the biological networks and enables us to generate pseudo-relations in the graph. Our framework achieves state-of-the-art results in comparison with other deep learning-based methods on synergistic prediction benchmark datasets. We are also capable of inferencing new drug combination data in a test on an independent set released by AstraZeneca, where 10% of improvement over previous methods is observed. In addition, we're robust against unseen drugs and surpass almost 15% AU ROC compared to the second-best model. We believe our framework contributes to both the future wet-lab discovery of novel drugs and the building of promising guidance for precise combination medicine.
    Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint). (arXiv:2301.05965v1 [cs.DB])
    Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers. Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well. In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.
    World Models and Predictive Coding for Cognitive and Developmental Robotics: Frontiers and Challenges. (arXiv:2301.05832v1 [cs.RO])
    Creating autonomous robots that can actively explore the environment, acquire knowledge and learn skills continuously is the ultimate achievement envisioned in cognitive and developmental robotics. Their learning processes should be based on interactions with their physical and social world in the manner of human learning and cognitive development. Based on this context, in this paper, we focus on the two concepts of world models and predictive coding. Recently, world models have attracted renewed attention as a topic of considerable interest in artificial intelligence. Cognitive systems learn world models to better predict future sensory observations and optimize their policies, i.e., controllers. Alternatively, in neuroscience, predictive coding proposes that the brain continuously predicts its inputs and adapts to model its own dynamics and control behavior in its environment. Both ideas may be considered as underpinning the cognitive development of robots and humans capable of continual or lifelong learning. Although many studies have been conducted on predictive coding in cognitive robotics and neurorobotics, the relationship between world model-based approaches in AI and predictive coding in robotics has rarely been discussed. Therefore, in this paper, we clarify the definitions, relationships, and status of current research on these topics, as well as missing pieces of world models and predictive coding in conjunction with crucially related concepts such as the free-energy principle and active inference in the context of cognitive and developmental robotics. Furthermore, we outline the frontiers and challenges involved in world models and predictive coding toward the further integration of AI and robotics, as well as the creation of robots with real cognitive and developmental capabilities in the future.
    Hand Gesture Recognition through Reflected Infrared Light Wave Signals. (arXiv:2301.05955v1 [eess.SP])
    In this study, we present a wireless (non-contact) gesture recognition method using only incoherent light wave signals reflected from a human subject. In comparison to existing radar, light shadow, sound and camera-based sensing systems, this technology uses a low-cost ubiquitous light source (e.g., infrared LED) to send light towards the subject's hand performing gestures and the reflected light is collected by a light sensor (e.g., photodetector). This light wave sensing system recognizes different gestures from the variations of the received light intensity within a 20-35cm range. The hand gesture recognition results demonstrate up to 96% accuracy on average. The developed system can be utilized in numerous Human-computer Interaction (HCI) applications as a low-cost and non-contact gesture recognition technology.
    Risk-Averse Reinforcement Learning via Dynamic Time-Consistent Risk Measures. (arXiv:2301.05981v1 [cs.LG])
    Traditional reinforcement learning (RL) aims to maximize the expected total reward, while the risk of uncertain outcomes needs to be controlled to ensure reliable performance in a risk-averse setting. In this paper, we consider the problem of maximizing dynamic risk of a sequence of rewards in infinite-horizon Markov Decision Processes (MDPs). We adapt the Expected Conditional Risk Measures (ECRMs) to the infinite-horizon risk-averse MDP and prove its time consistency. Using a convex combination of expectation and conditional value-at-risk (CVaR) as a special one-step conditional risk measure, we reformulate the risk-averse MDP as a risk-neutral counterpart with augmented action space and manipulation on the immediate rewards. We further prove that the related Bellman operator is a contraction mapping, which guarantees the convergence of any value-based RL algorithms. Accordingly, we develop a risk-averse deep Q-learning framework, and our numerical studies based on two simple MDPs show that the risk-averse setting can reduce the variance and enhance robustness of the results.
    Generalized Invariant Matching Property via LASSO. (arXiv:2301.05975v1 [stat.ME])
    Learning under distribution shifts is a challenging task. One principled approach is to exploit the invariance principle via the structural causal models. However, the invariance principle is violated when the response is intervened, making it a difficult setting. In a recent work, the invariant matching property has been developed to shed light on this scenario and shows promising performance. In this work, we generalize the invariant matching property by formulating a high-dimensional problem with intrinsic sparsity. We propose a more robust and computation-efficient algorithm by leveraging a variant of Lasso, improving upon the existing algorithms.
    Adaptive Neural Networks Using Residual Fitting. (arXiv:2301.05744v1 [cs.LG])
    Current methods for estimating the required neural-network size for a given problem class have focused on methods that can be computationally intensive, such as neural-architecture search and pruning. In contrast, methods that add capacity to neural networks as needed may provide similar results to architecture search and pruning, but do not require as much computation to find an appropriate network size. Here, we present a network-growth method that searches for explainable error in the network's residuals and grows the network if sufficient error is detected. We demonstrate this method using examples from classification, imitation learning, and reinforcement learning. Within these tasks, the growing network can often achieve better performance than small networks that do not grow, and similar performance to networks that begin much larger.
    Day-Ahead PV Power Forecasting Based on MSTL-TFT. (arXiv:2301.05911v1 [cs.LG])
    Energy demand is increasing dramatically as global urbanization progresses.Solar energy is a clean energy source with low production and maintenance costs.Accurately predicted PV generation is of great importance for grid integration.Recent day-ahead PV forecasting studies mainly include generation data decomposition, additional meteorological and equipment features, improvement and integration of ANN-based models.We proposed a MSTL-TFT method for day-ahead PV forecasting. The results are better than any of the other studies we have surveyed on day-ahead DKASC PV forecasting.
    Lung airway geometry as an early predictor of autism: A preliminary machine learning-based study. (arXiv:2301.05777v1 [cs.LG])
    The goal of this study is to assess the feasibility of airway geometry as a biomarker for ASD. Chest CT images of children with a documented diagnosis of ASD as well as healthy controls were identified retrospectively. 54 scans were obtained for analysis, including 31 ASD cases and 23 age and sex-matched controls. A feature selection and classification procedure using principal component analysis (PCA) and support vector machine (SVM) achieved a peak cross validation accuracy of nearly 89% using a feature set of 8 airway branching angles. Sensitivity was 94%, but specificity was only 78%. The results suggest a measurable difference in airway branchpoint angles between children with ASD and the control population.
    CrysGNN : Distilling pre-trained knowledge to enhance property prediction for crystalline materials. (arXiv:2301.05852v1 [cs.LG])
    In recent years, graph neural network (GNN) based approaches have emerged as a powerful technique to encode complex topological structure of crystal materials in an enriched representation space. These models are often supervised in nature and using the property-specific training data, learn relationship between crystal structure and different properties like formation energy, bandgap, bulk modulus, etc. Most of these methods require a huge amount of property-tagged data to train the system which may not be available for different properties. However, there is an availability of a huge amount of crystal data with its chemical composition and structural bonds. To leverage these untapped data, this paper presents CrysGNN, a new pre-trained GNN framework for crystalline materials, which captures both node and graph level structural information of crystal graphs using a huge amount of unlabelled material data. Further, we extract distilled knowledge from CrysGNN and inject into different state of the art property predictors to enhance their property prediction accuracy. We conduct extensive experiments to show that with distilled knowledge from the pre-trained model, all the SOTA algorithms are able to outperform their own vanilla version with good margins. We also observe that the distillation process provides a significant improvement over the conventional approach of finetuning the pre-trained model. We have released the pre-trained model along with the large dataset of 800K crystal graph which we carefully curated; so that the pretrained model can be plugged into any existing and upcoming models to enhance their prediction accuracy.
    Discovery of 2D materials using Transformer Network based Generative Design. (arXiv:2301.05824v1 [cond-mat.mtrl-sci])
    Two-dimensional (2D) materials have wide applications in superconductors, quantum, and topological materials. However, their rational design is not well established, and currently less than 6,000 experimentally synthesized 2D materials have been reported. Recently, deep learning, data-mining, and density functional theory (DFT)-based high-throughput calculations are widely performed to discover potential new materials for diverse applications. Here we propose a generative material design pipeline, namely material transformer generator(MTG), for large-scale discovery of hypothetical 2D materials. We train two 2D materials composition generators using self-learning neural language models based on Transformers with and without transfer learning. The models are then used to generate a large number of candidate 2D compositions, which are fed to known 2D materials templates for crystal structure prediction. Next, we performed DFT computations to study their thermodynamic stability based on energy-above-hull and formation energy. We report four new DFT-verified stable 2D materials with zero e-above-hull energies, including NiCl$_4$, IrSBr, CuBr$_3$, and CoBrCl. Our work thus demonstrates the potential of our MTG generative materials design pipeline in the discovery of novel 2D materials and other functional materials.
    A Survey of Self-Supervised Learning from Multiple Perspectives: Algorithms, Theory, Applications and Future Trends. (arXiv:2301.05712v1 [cs.LG])
    Deep supervised learning algorithms generally require large numbers of labeled examples to attain satisfactory performance. To avoid the expensive cost incurred by collecting and labeling too many examples, as a subset of unsupervised learning, self-supervised learning (SSL) was proposed to learn good features from many unlabeled examples without any human-annotated labels. SSL has recently become a hot research topic, and many related algorithms have been proposed. However, few comprehensive studies have explained the connections among different SSL variants and how they have evolved. In this paper, we attempt to provide a review of the various SSL methods from the perspectives of algorithms, theory, applications, three main trends, and open questions. First, the motivations of most SSL algorithms are introduced in detail, and their commonalities and differences are compared. Second, the theoretical issues associated with SSL are investigated. Third, typical applications of SSL in areas such as image processing and computer vision (CV), as well as natural language processing (NLP), are discussed. Finally, the three main trends of SSL and the open research questions are discussed. A collection of useful materials is available at https://github.com/guijiejie/SSL.
    Survey of Knowledge Distillation in Federated Edge Learning. (arXiv:2301.05849v1 [cs.LG])
    The increasing demand for intelligent services and privacy protection of mobile and Internet of Things (IoT) devices motivates the wide application of Federated Edge Learning (FEL), in which devices collaboratively train on-device Machine Learning (ML) models without sharing their private data. \textcolor{black}{Limited by device hardware, diverse user behaviors and network infrastructure, the algorithm design of FEL faces challenges related to resources, personalization and network environments}, and Knowledge Distillation (KD) has been leveraged as an important technique to tackle the above challenges in FEL. In this paper, we investigate the works that KD applies to FEL, discuss the limitations and open problems of existing KD-based FEL approaches, and provide guidance for their real deployment.
    First Three Years of the International Verification of Neural Networks Competition (VNN-COMP). (arXiv:2301.05815v1 [cs.LG])
    This paper presents a summary and meta-analysis of the first three iterations of the annual International Verification of Neural Networks Competition (VNN-COMP) held in 2020, 2021, and 2022. In the VNN-COMP, participants submit software tools that analyze whether given neural networks satisfy specifications describing their input-output behavior. These neural networks and specifications cover a variety of problem classes and tasks, corresponding to safety and robustness properties in image classification, neural control, reinforcement learning, and autonomous systems. We summarize the key processes, rules, and results, present trends observed over the last three years, and provide an outlook into possible future developments.
    A Rigorous Uncertainty-Aware Quantification Framework Is Essential for Reproducible and Replicable Machine Learning Workflows. (arXiv:2301.05763v1 [cs.LG])
    The ability to replicate predictions by machine learning (ML) or artificial intelligence (AI) models and results in scientific workflows that incorporate such ML/AI predictions is driven by numerous factors. An uncertainty-aware metric that can quantitatively assess the reproducibility of quantities of interest (QoI) would contribute to the trustworthiness of results obtained from scientific workflows involving ML/AI models. In this article, we discuss how uncertainty quantification (UQ) in a Bayesian paradigm can provide a general and rigorous framework for quantifying reproducibility for complex scientific workflows. Such as framework has the potential to fill a critical gap that currently exists in ML/AI for scientific workflows, as it will enable researchers to determine the impact of ML/AI model prediction variability on the predictive outcomes of ML/AI-powered workflows. We expect that the envisioned framework will contribute to the design of more reproducible and trustworthy workflows for diverse scientific applications, and ultimately, accelerate scientific discoveries.
    CEDAS: A Compressed Decentralized Stochastic Gradient Method with Improved Convergence. (arXiv:2301.05872v1 [math.OC])
    In this paper, we consider solving the distributed optimization problem over a multi-agent network under the communication restricted setting. We study a compressed decentralized stochastic gradient method, termed ``compressed exact diffusion with adaptive stepsizes (CEDAS)", and show the method asymptotically achieves comparable convergence rate as centralized SGD for both smooth strongly convex objective functions and smooth nonconvex objective functions under unbiased compression operators. In particular, to our knowledge, CEDAS enjoys so far the shortest transient time (with respect to the graph specifics) for achieving the convergence rate of centralized SGD, which behaves as $\mathcal{O}(nC^3/(1-\lambda_2)^{2})$ under smooth strongly convex objective functions, and $\mathcal{O}(n^3C^6/(1-\lambda_2)^4)$ under smooth nonconvex objective functions, where $(1-\lambda_2)$ denotes the spectral gap of the mixing matrix, and $C>0$ is the compression-related parameter. Numerical experiments further demonstrate the effectiveness of the proposed algorithm.
    Poisoning Attacks and Defenses in Federated Learning: A Survey. (arXiv:2301.05795v1 [cs.CR])
    Federated learning (FL) enables the training of models among distributed clients without compromising the privacy of training datasets, while the invisibility of clients datasets and the training process poses a variety of security threats. This survey provides the taxonomy of poisoning attacks and experimental evaluation to discuss the need for robust FL.
    GAR: Generalized Autoregression for Multi-Fidelity Fusion. (arXiv:2301.05729v1 [stat.ML])
    In many scientific research and engineering applications where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multifidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d 3, where d is the dimensionality of the output. The empirical assessment includes many canonical PDEs and real scientific examples and demonstrates that the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple high-fidelity training samples.
    A Comprehensive Survey of Graph-level Learning. (arXiv:2301.05860v1 [cs.LG])
    Graphs have a superior ability to represent relational data, like chemical compounds, proteins, and social networks. Hence, graph-level learning, which takes a set of graphs as input, has been applied to many tasks including comparison, regression, classification, and more. Traditional approaches to learning a set of graphs tend to rely on hand-crafted features, such as substructures. But while these methods benefit from good interpretability, they often suffer from computational bottlenecks as they cannot skirt the graph isomorphism problem. Conversely, deep learning has helped graph-level learning adapt to the growing scale of graphs by extracting features automatically and decoding graphs into low-dimensional representations. As a result, these deep graph learning methods have been responsible for many successes. Yet, there is no comprehensive survey that reviews graph-level learning starting with traditional learning and moving through to the deep learning approaches. This article fills this gap and frames the representative algorithms into a systematic taxonomy covering traditional learning, graph-level deep neural networks, graph-level graph neural networks, and graph pooling. To ensure a thoroughly comprehensive survey, the evolutions, interactions, and communications between methods from four different branches of development are also examined. This is followed by a brief review of the benchmark data sets, evaluation metrics, and common downstream applications. The survey concludes with 13 future directions of necessary research that will help to overcome the challenges facing this booming field.
    Local Model Explanations and Uncertainty Without Model Access. (arXiv:2301.05761v1 [cs.LG])
    We present a model-agnostic algorithm for generating post-hoc explanations and uncertainty intervals for a machine learning model when only a sample of inputs and outputs from the model is available, rather than direct access to the model itself. This situation may arise when model evaluations are expensive; when privacy, security and bandwidth constraints are imposed; or when there is a need for real-time, on-device explanations. Our algorithm constructs explanations using local polynomial regression and quantifies the uncertainty of the explanations using a bootstrapping approach. Through a simulation study, we show that the uncertainty intervals generated by our algorithm exhibit a favorable trade-off between interval width and coverage probability compared to the naive confidence intervals from classical regression analysis. We further demonstrate the capabilities of our method by applying it to black-box models trained on two real datasets.
    Who Should I Trust: AI or Myself? Leveraging Human and AI Correctness Likelihood to Promote Appropriate Trust in AI-Assisted Decision-Making. (arXiv:2301.05809v1 [cs.HC])
    In AI-assisted decision-making, it is critical for human decision-makers to know when to trust AI and when to trust themselves. However, prior studies calibrated human trust only based on AI confidence indicating AI's correctness likelihood (CL) but ignored humans' CL, hindering optimal team decision-making. To mitigate this gap, we proposed to promote humans' appropriate trust based on the CL of both sides at a task-instance level. We first modeled humans' CL by approximating their decision-making models and computing their potential performance in similar instances. We demonstrated the feasibility and effectiveness of our model via two preliminary studies. Then, we proposed three CL exploitation strategies to calibrate users' trust explicitly/implicitly in the AI-assisted decision-making process. Results from a between-subjects experiment (N=293) showed that our CL exploitation strategies promoted more appropriate human trust in AI, compared with only using AI confidence. We further provided practical implications for more human-compatible AI-assisted decision-making.
    Functional Neural Networks: Shift invariant models for functional data with applications to EEG classification. (arXiv:2301.05869v1 [cs.LG])
    It is desirable for statistical models to detect signals of interest independently of their position. If the data is generated by some smooth process, this additional structure should be taken into account. We introduce a new class of neural networks that are shift invariant and preserve smoothness of the data: functional neural networks (FNNs). For this, we use methods from functional data analysis (FDA) to extend multi-layer perceptrons and convolutional neural networks to functional data. We propose different model architectures, show that the models outperform a benchmark model from FDA in terms of accuracy and successfully use FNNs to classify electroencephalography (EEG) data.
    Insights Into Deep Non-linear Filters for Improved Multi-channel Speech Enhancement. (arXiv:2206.13310v3 [eess.AS] UPDATED)
    The key advantage of using multiple microphones for speech enhancement is that spatial filtering can be used to complement the tempo-spectral processing. In a traditional setting, linear spatial filtering (beamforming) and single-channel post-filtering are commonly performed separately. In contrast, there is a trend towards employing deep neural networks (DNNs) to learn a joint spatial and tempo-spectral non-linear filter, which means that the restriction of a linear processing model and that of a separate processing of spatial and tempo-spectral information can potentially be overcome. However, the internal mechanisms that lead to good performance of such data-driven filters for multi-channel speech enhancement are not well understood. Therefore, in this work, we analyse the properties of a non-linear spatial filter realized by a DNN as well as its interdependency with temporal and spectral processing by carefully controlling the information sources (spatial, spectral, and temporal) available to the network. We confirm the superiority of a non-linear spatial processing model, which outperforms an oracle linear spatial filter in a challenging speaker extraction scenario for a low number of microphones by 0.24 POLQA score. Our analyses reveal that in particular spectral information should be processed jointly with spatial information as this increases the spatial selectivity of the filter. Our systematic evaluation then leads to a simple network architecture, that outperforms state-of-the-art network architectures on a speaker extraction task by 0.22 POLQA score and by 0.32 POLQA score on the CHiME3 data.
    Disentangling representations in Restricted Boltzmann Machines without adversaries. (arXiv:2206.11600v3 [cs.LG] UPDATED)
    A goal of unsupervised machine learning is to build representations of complex high-dimensional data, with simple relations to their properties. Such disentangled representations make easier to interpret the significant latent factors of variation in the data, as well as to generate new data with desirable features. Methods for disentangling representations often rely on an adversarial scheme, in which representations are tuned to avoid discriminators from being able to reconstruct information about the data properties (labels). Unfortunately adversarial training is generally difficult to implement in practice. Here we propose a simple, effective way of disentangling representations without any need to train adversarial discriminators, and apply our approach to Restricted Boltzmann Machines (RBM), one of the simplest representation-based generative models. Our approach relies on the introduction of adequate constraints on the weights during training, which allows us to concentrate information about labels on a small subset of latent variables. The effectiveness of the approach is illustrated with four examples: the CelebA dataset of facial images, the two-dimensional Ising model, the MNIST dataset of handwritten digits, and the taxonomy of protein families. In addition, we show how our framework allows for analytically computing the cost, in terms of log-likelihood of the data, associated to the disentanglement of their representations.
    Current Trends in Deep Learning for Earth Observation: An Open-source Benchmark Arena for Image Classification. (arXiv:2207.07189v2 [cs.CV] UPDATED)
    We present AiTLAS: Benchmark Arena -- an open-source benchmark suite for evaluating state-of-the-art deep learning approaches for image classification in Earth Observation (EO). To this end, we present a comprehensive comparative analysis of more than 500 models derived from ten different state-of-the-art architectures and compare them to a variety of multi-class and multi-label classification tasks from 22 datasets with different sizes and properties. In addition to models trained entirely on these datasets, we benchmark models trained in the context of transfer learning, leveraging pre-trained model variants, as it is typically performed in practice. All presented approaches are general and can be easily extended to many other remote sensing image classification tasks not considered in this study. To ensure reproducibility and facilitate better usability and further developments, all of the experimental resources including the trained models, model configurations, and processing details of the datasets (with their corresponding splits used for training and evaluating the models) are publicly available on the repository: https://github.com/biasvariancelabs/aitlas-arena
    Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. (arXiv:2206.07085v3 [cs.LG] UPDATED)
    Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v4 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Counterfactual Fairness with Partially Known Causal Graph. (arXiv:2205.13972v3 [cs.LG] UPDATED)
    Fair machine learning aims to avoid treating individuals or sub-populations unfavourably based on \textit{sensitive attributes}, such as gender and race. Those methods in fair machine learning that are built on causal inference ascertain discrimination and bias through causal effects. Though causality-based fair learning is attracting increasing attention, current methods assume the true causal graph is fully known. This paper proposes a general method to achieve the notion of counterfactual fairness when the true causal graph is unknown. To be able to select features that lead to counterfactual fairness, we derive the conditions and algorithms to identify ancestral relations between variables on a \textit{Partially Directed Acyclic Graph (PDAG)}, specifically, a class of causal DAGs that can be learned from observational data combined with domain knowledge. Interestingly, we find that counterfactual fairness can be achieved as if the true causal graph were fully known, when specific background knowledge is provided: the sensitive attributes do not have ancestors in the causal graph. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.
    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies. (arXiv:2204.01058v2 [math.PR] UPDATED)
    This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.
    Policy Gradients using Variational Quantum Circuits. (arXiv:2203.10591v3 [quant-ph] UPDATED)
    Variational Quantum Circuits are being used as versatile Quantum Machine Learning models. Some empirical results exhibit an advantage in supervised and generative learning tasks. However, when applied to Reinforcement Learning, less is known. In this work, we considered a Variational Quantum Circuit composed of a low-depth hardware-efficient ansatz as the parameterized policy of a Reinforcement Learning agent. We show that an $\epsilon$-approximation of the policy gradient can be obtained using a logarithmic number of samples concerning the total number of parameters. We empirically verify that such quantum models behave similarly or even outperform typical classical neural networks used in standard benchmarking environments and in quantum control, using only a fraction of the parameters. Moreover, we study the Barren Plateau phenomenon in quantum policy gradients using the Fisher Information Matrix spectrum.
    Sharing to learn and learning to share -- Fitting together Meta-Learning, Multi-Task Learning, and Transfer Learning: A meta review. (arXiv:2111.12146v5 [cs.LG] UPDATED)
    Integrating knowledge across different domains is an essential feature of human learning. Learning paradigms such as transfer learning, meta learning, and multi-task learning reflect the human learning process by exploiting the prior knowledge for new tasks, encouraging faster learning and good generalization for new tasks. This article gives a detailed view of these learning paradigms and their comparative analysis. The weakness of one learning algorithm turns out to be a strength of another, and thus merging them is a prevalent trait in the literature. There are numerous research papers that focus on each of these learning paradigms separately and provide a comprehensive overview of them. However, this article provides a review of research studies that combine (two of) these learning algorithms. This survey describes how these techniques are combined to solve problems in many different fields of study, including computer vision, natural language processing, hyperspectral imaging, and many more, in supervised setting only. As a result, the global generic learning network an amalgamation of meta learning, transfer learning, and multi-task learning is introduced here, along with some open research questions and future research directions in the multi-task setting.
    Privatized Graph Federated Learning. (arXiv:2203.07105v2 [cs.LG] UPDATED)
    Federated learning is a semi-distributed algorithm, where a server communicates with multiple dispersed clients to learn a global model. The federated architecture is not robust and is sensitive to communication and computational overloads due to its one-master multi-client structure. It can also be subject to privacy attacks targeting personal information on the communication links. In this work, we introduce graph federated learning (GFL), which consists of multiple federated units connected by a graph. We then show how graph homomorphic perturbations can be used to ensure the algorithm is differentially private. We conduct both convergence and privacy theoretical analyses and illustrate performance by means of computer simulations.
    Learning Partial Equivariances from Data. (arXiv:2110.10211v3 [cs.CV] UPDATED)
    Group Convolutional Neural Networks (G-CNNs) constrain learned features to respect the symmetries in the selected group, and lead to better generalization when these symmetries appear in the data. If this is not the case, however, equivariance leads to overly constrained models and worse performance. Frequently, transformations occurring in data can be better represented by a subset of a group than by a group as a whole, e.g., rotations in $[-90^{\circ}, 90^{\circ}]$. In such cases, a model that respects equivariance $\textit{partially}$ is better suited to represent the data. In addition, relevant transformations may differ for low and high-level features. For instance, full rotation equivariance is useful to describe edge orientations in a face, but partial rotation equivariance is better suited to describe face poses relative to the camera. In other words, the optimal level of equivariance may differ per layer. In this work, we introduce $\textit{Partial G-CNNs}$: G-CNNs able to learn layer-wise levels of partial and full equivariance to discrete, continuous groups and combinations thereof as part of training. Partial G-CNNs retain full equivariance when beneficial, e.g., for rotated MNIST, but adjust it whenever it becomes harmful, e.g., for classification of 6 / 9 digits or natural images. We empirically show that partial G-CNNs pair G-CNNs when full equivariance is advantageous, and outperform them otherwise.
    Variational Actor-Critic Algorithms. (arXiv:2108.01215v4 [cs.LG] UPDATED)
    We introduce a class of variational actor-critic algorithms based on a variational formulation over both the value function and the policy. The objective function of the variational formulation consists of two parts: one for maximizing the value function and the other for minimizing the Bellman residual. Besides the vanilla gradient descent with both the value function and the policy updates, we propose two variants, the clipping method and the flipping method, in order to speed up the convergence. We also prove that, when the prefactor of the Bellman residual is sufficiently large, the fixed point of the algorithm is close to the optimal policy.
    Smart Choices and the Selection Monad. (arXiv:2007.08926v7 [cs.LO] UPDATED)
    Describing systems in terms of choices and their resulting costs and rewards offers the promise of freeing algorithm designers and programmers from specifying how those choices should be made; in implementations, the choices can be realized by optimization techniques and, increasingly, by machine-learning methods. We study this approach from a programming-language perspective. We define two small languages that support decision-making abstractions: one with choices and rewards, and the other additionally with probabilities. We give both operational and denotational semantics. In the case of the second language we consider three denotational semantics, with varying degrees of correlation between possible program values and expected rewards. The operational semantics combine the usual semantics of standard constructs with optimization over spaces of possible execution strategies. The denotational semantics, which are compositional, rely on the selection monad, to handle choice, augmented with an auxiliary monad to handle other effects, such as rewards or probability. We establish adequacy theorems that the two semantics coincide in all cases. We also prove full abstraction at base types, with varying notions of observation in the probabilistic case corresponding to the various degrees of correlation. We present axioms for choice combined with rewards and probability, establishing completeness at base types for the case of rewards without probability.
    Elastic Similarity and Distance Measures for Multivariate Time Series. (arXiv:2102.10231v2 [cs.LG] UPDATED)
    This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures are a class of similarity measures that can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.
    SRECG: ECG Signal Super-resolution Framework for Portable/Wearable Devices in Cardiac Arrhythmias Classification. (arXiv:2012.03803v2 [eess.SP] UPDATED)
    A combination of cloud-based deep learning (DL) algorithms with portable/wearable (P/W) devices has been developed as a smart heath care system to support automatic cardiac arrhythmias (CAs) classification using electrocardiography (ECG). However, long-term and continuous ECG monitoring is challenging because of limitations of batteries and transmission bandwidth of P/W devices while incorporated with consumer electronics (CE). A feasible approach to address this challenge is to decrease sampling rates. However, low sampling rates lead to low-resolution signals that hinder the CAs classification performance. In this study, we propose a DL-based ECG signal super-resolution framework (called SRECG) to enhance low-resolution ECG signals by jointly considering the accuracies when applied to the DL-based high-resolution multiclass classifier (HMC) of CAs. In our experiments, we downsampled the ECG signals from the CPSC2018 dataset and evaluated their HMC accuracies with and without the SRECG. Experimental results show that SRECG can well improve the HMC accuracies as compared to traditional interpolation methods. Moreover, approximately half of the CAs classification accuracies of HMC were maintained within the enhanced ECG signals by SRECG. The promising results confirm that SRECG can be suitably used to enhance low-resolution ECG signals from P/W devices with CE to improve their cloud-based HMC performances.
    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection. (arXiv:2006.14563v2 [cs.CV] UPDATED)
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at $\href{https://github.com/asvspoof/D3M}{\text{https://github.com/asvspoof/D3M}}$.
    Efficient anomaly detection method for rooftop PV systems using big data and permutation entropy. (arXiv:2301.06035v1 [cs.LG])
    The number of rooftop photovoltaic (PV) systems has significantly increased in recent years around the globe, including in Australia. This trend is anticipated to continue in the next few years. Given their high share of generation in power systems, detecting malfunctions and abnormalities in rooftop PV systems is essential for ensuring their high efficiency and safety. In this paper, we present a novel anomaly detection method for a large number of rooftop PV systems installed in a region using big data and a time series complexity measure called weighted permutation entropy (WPE). This efficient method only uses the historical PV generation data in a given region to identify anomalous PV systems and requires no new sensor or smart device. Using a real-world PV generation dataset, we discuss how the hyperparameters of WPE should be tuned for the purpose. The proposed PV anomaly detection method is then tested on rooftop PV generation data from over 100 South Australian households. The results demonstrate that anomalous systems detected by our method have indeed encountered problems and require a close inspection. The detection and resolution of potential faults would result in better rooftop PV systems, longer lifetimes, and higher returns on investment.
    Hawk: An Industrial-strength Multi-label Document Classifier. (arXiv:2301.06057v1 [cs.CL])
    There are a plethora of methods and algorithms that solve the classical multi-label document classification. However, when it comes to deployment and usage in an industry setting, most, if not all the contemporary approaches fail to address some of the vital aspects or requirements of an ideal solution: i. ability to operate on variable-length texts and rambling documents. ii. catastrophic forgetting problem. iii. modularity when it comes to online learning and updating the model. iv. ability to spotlight relevant text while producing the prediction, i.e. visualizing the predictions. v. ability to operate on imbalanced or skewed datasets. vi. scalability. The paper describes the significance of these problems in detail and proposes a unique neural network architecture that addresses the above problems. The proposed architecture views documents as a sequence of sentences and leverages sentence-level embeddings for input representation. A hydranet-like architecture is designed to have granular control over and improve the modularity, coupled with a weighted loss driving task-specific heads. In particular, two specific mechanisms are compared: Bi-LSTM and Transformer-based. The architecture is benchmarked on some of the popular benchmarking datasets such as Web of Science - 5763, Web of Science - 11967, BBC Sports, and BBC News datasets. The experimental results reveal that the proposed model outperforms the existing methods by a substantial margin. The ablation study includes comparisons of the impact of the attention mechanism and the application of weighted loss functions to train the task-specific heads in the hydranet.
    Deep Learning Provides Rapid Screen for Breast Cancer Metastasis with Sentinel Lymph Nodes. (arXiv:2301.05938v1 [cs.CV])
    Deep learning has been shown to be useful to detect breast cancer metastases by analyzing whole slide images of sentinel lymph nodes. However, it requires extensive scanning and analysis of all the lymph nodes slides for each case. Our deep learning study focuses on breast cancer screening with only a small set of image patches from any sentinel lymph node, positive or negative for metastasis, to detect changes in tumor environment and not in the tumor itself. We design a convolutional neural network in the Python language to build a diagnostic model for this purpose. The excellent results from this preliminary study provided a proof of concept for incorporating automated metastatic screen into the digital pathology workflow to augment the pathologists' productivity. Our approach is unique since it provides a very rapid screen rather than an exhaustive search for tumor in all fields of all sentinel lymph nodes.
    Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks. (arXiv:1906.04893v2 [cs.LG] UPDATED)
    Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.
    Compress Then Test: Powerful Kernel Testing in Near-linear Time. (arXiv:2301.05974v1 [stat.ML])
    Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.
    A data science and machine learning approach to continuous analysis of Shakespeare's plays. (arXiv:2301.06024v1 [cs.CL])
    The availability of quantitative methods that can analyze text has provided new ways of examining literature in a manner that was not available in the pre-information era. Here we apply comprehensive machine learning analysis to the work of William Shakespeare. The analysis shows clear change in style of writing over time, with the most significant changes in the sentence length, frequency of adjectives and adverbs, and the sentiments expressed in the text. Applying machine learning to make a stylometric prediction of the year of the play shows a Pearson correlation of 0.71 between the actual and predicted year, indicating that Shakespeare's writing style as reflected by the quantitative measurements changed over time. Additionally, it shows that the stylometrics of some of the plays is more similar to plays written either before or after the year they were written. For instance, Romeo and Juliet is dated 1596, but is more similar in stylometrics to plays written by Shakespeare after 1600. The source code for the analysis is available for free download.
    Transferring Fairness under Distribution Shifts via Fair Consistency Regularization. (arXiv:2206.12796v3 [cs.LG] UPDATED)
    The increasing reliance on ML models in high-stakes tasks has raised a major concern on fairness violations. Although there has been a surge of work that improves algorithmic fairness, most of them are under the assumption of an identical training and test distribution. In many real-world applications, however, such an assumption is often violated as previously trained fair models are often deployed in a different environment, and the fairness of such models has been observed to collapse. In this paper, we study how to transfer model fairness under distribution shifts, a widespread issue in practice. We conduct a fine-grained analysis of how the fair model is affected under different types of distribution shifts and find that domain shifts are more challenging than subpopulation shifts. Inspired by the success of self-training in transferring accuracy under domain shifts, we derive a sufficient condition for transferring group fairness. Guided by it, we propose a practical algorithm with a fair consistency regularization as the key component. A synthetic dataset benchmark, which covers all types of distribution shifts, is deployed for experimental verification of the theoretical findings. Experiments on synthetic and real datasets including image and tabular data demonstrate that our approach effectively transfers fairness and accuracy under various distribution shifts.
    Evaluating the Spectral Bias of Coordinate Based MLPs. (arXiv:2301.05816v1 [cs.LG])
    In recent years, representations given by fully connected neural networks have shown to represent scenes, objects, and other measurements well in dense low-dimensional settings. For these models, termed coordinate based MLPs, sinusoidal encodings are necessary in allowing for convergence to the high frequency components of the target function. This requirement is a result of their severe spectral bias when using dense, low dimensional coordinate based inputs. Previous work explained this phenomena using Neural Tangent Kernel (NTK) and Fourier analysis. While these methods provide insight towards this large spectral bias and the benefits of positional encoding, the properties of ReLU networks that induce this behavior are not fully determined. Analyzing spectral bias directly through the computations of ReLU networks would expose their limitations in dense settings, while providing a clearer explanation as to how this behavior emerges during the learning process. In this paper, we systematically analyze the spectral bias of a coordinate based MLP through its activation regions and gradient descent dynamics. This allows us to relate the network's expressive capacity to the speed at which gradient descent converges for components of varying frequency, and how the density of the data further restricts the model.
    What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. (arXiv:2208.01066v2 [cs.CL] UPDATED)
    In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .
    Reinforcement learning on graphs: A survey. (arXiv:2204.06127v4 [cs.LG] UPDATED)
    Graph mining tasks arise from many different application domains, ranging from social networks, transportation to E-commerce, etc., which have been receiving great attention from the theoretical and algorithmic design communities in recent years, and there has been some pioneering work employing the research-rich Reinforcement Learning (RL) techniques to address graph data mining tasks. However, these graph mining methods and RL models are dispersed in different research areas, which makes it hard to compare them. In this survey, we provide a comprehensive overview of RL and graph mining methods and generalize these methods to Graph Reinforcement Learning (GRL) as a unified formulation. We further discuss the applications of GRL methods across various domains and summarize the method descriptions, open-source codes, and benchmark datasets of GRL methods. Furthermore, we propose important directions and challenges to be solved in the future. As far as we know, this is the latest work on a comprehensive survey of GRL, this work provides a global view and a learning resource for scholars. In addition, we create an online open-source for both interested scholars who want to enter this rapidly developing domain and experts who would like to compare GRL methods.
    K-Deep Simplex: Deep Manifold Learning via Local Dictionaries. (arXiv:2012.02134v3 [cs.LG] UPDATED)
    We propose K-Deep Simplex (KDS) which, given a set of data points, learns a dictionary comprising synthetic landmarks, along with representation coefficients supported on a simplex. KDS integrates manifold learning and sparse coding/dictionary learning: reconstruction term, as in classical dictionary learning, and a novel local weighted $\ell_1$ penalty that encourages each data point to represent itself as a convex combination of nearby landmarks. We solve the proposed optimization program using alternating minimization and design an efficient, interpretable autoencoder using algorithm enrolling. We theoretically analyze the proposed program by relating the weighted $\ell_1$ penalty in KDS to a weighted $\ell_0$ program. Assuming that the data are generated from a Delaunay triangulation, we prove the equivalence of the weighted $\ell_1$ and weighted $\ell_0$ programs. If the representation coefficients are given, we prove that the resulting dictionary is unique. Further, we show that low-dimensional representations can be efficiently obtained from the covariance of the coefficient matrix. We apply KDS to the unsupervised clustering problem and prove theoretical performance guarantees. Experiments show that the algorithm is highly efficient and performs competitively on synthetic and real data sets.
    Salient Sign Detection In Safe Autonomous Driving: AI Which Reasons Over Full Visual Context. (arXiv:2301.05804v1 [cs.CV])
    Detecting road traffic signs and accurately determining how they can affect the driver's future actions is a critical task for safe autonomous driving systems. However, various traffic signs in a driving scene have an unequal impact on the driver's decisions, making detecting the salient traffic signs a more important task. Our research addresses this issue, constructing a traffic sign detection model which emphasizes performance on salient signs, or signs that influence the decisions of a driver. We define a traffic sign salience property and use it to construct the LAVA Salient Signs Dataset, the first traffic sign dataset that includes an annotated salience property. Next, we use a custom salience loss function, Salience-Sensitive Focal Loss, to train a Deformable DETR object detection model in order to emphasize stronger performance on salient signs. Results show that a model trained with Salience-Sensitive Focal Loss outperforms a model trained without, with regards to recall of both salient signs and all signs combined. Further, the performance margin on salient signs compared to all signs is largest for the model trained with Salience-Sensitive Focal Loss.  ( 2 min )
    MLOps: A Primer for Policymakers on a New Frontier in Machine Learning. (arXiv:2301.05775v1 [cs.LG])
    This chapter is written with the Data Scientist or MLOps professional in mind but can be used as a resource for policy makers, reformists, AI Ethicists, sociologists, and others interested in finding methods that help reduce bias in algorithms. I will take a deployment centered approach with the assumption that the professionals reading this work have already read the amazing work on the implications of algorithms on historically marginalized groups by Gebru, Buolamwini, Benjamin and Shane to name a few. If you have not read those works, I refer you to the "Important Reading for Ethical Model Building" list at the end of this paper as it will help give you a framework on how to think about Machine Learning models more holistically taking into account their effect on marginalized people. In the Introduction to this chapter, I root the significance of their work in real world examples of what happens when models are deployed without transparent data collected for the training process and are deployed without the practitioners paying special attention to what happens to models that adapt to exploit gaps between their training environment and the real world. The rest of this chapter builds on the work of the aforementioned researchers and discusses the reality of models performing post production and details ways ML practitioners can identify bias using tools during the MLOps lifecycle to mitigate bias that may be introduced to models in the real world.  ( 2 min )
    ML Approach for Power Consumption Prediction in Virtualized Base Stations. (arXiv:2301.05764v1 [cs.LG])
    The flexibility introduced with the Open Radio Access Network (O-RAN) architecture allows us to think beyond static configurations in all parts of the network. This paper addresses the issue related to predicting the power consumption of different radio schedulers, and the potential offered by O-RAN to collect data, train models, and deploy policies to control the power consumption. We propose a black-box (Neural Network) model to learn the power consumption function. We compare our approach with a known hand-crafted solution based on domain knowledge. Our solution reaches similar performance without any previous knowledge of the application and provides more flexibility in scenarios where the system behavior is not well understood or the domain knowledge is not available.  ( 2 min )
    FedSSC: Shared Supervised-Contrastive Federated Learning. (arXiv:2301.05797v1 [cs.LG])
    Federated learning is widely used to perform decentralized training of a global model on multiple devices while preserving the data privacy of each device. However, it suffers from heterogeneous local data on each training device which increases the difficulty to reach the same level of accuracy as the centralized training. Supervised Contrastive Learning which outperform cross-entropy tries to minimizes the difference between feature space of points belongs to the same class and pushes away points from different classes. We propose Supervised Contrastive Federated Learning in which devices can share the learned class-wise feature spaces with each other and add the supervised-contrastive learning loss as a regularization term to foster the feature space learning. The loss tries to minimize the cosine similarity distance between the feature map and the averaged feature map from another device in the same class and maximizes the distance between the feature map and that in a different class. This new regularization term when added on top of the moon regularization term is found to outperform the other state-of-the-art regularization terms in solving the heterogeneous data distribution problem.  ( 2 min )
    A domain-decomposed VAE method for Bayesian inverse problems. (arXiv:2301.05708v1 [stat.ML])
    Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Eco-PiNN: A Physics-informed Neural Network for Eco-toll Estimation. (arXiv:2301.05739v1 [cs.LG])
    The eco-toll estimation problem quantifies the expected environmental cost (e.g., energy consumption, exhaust emissions) for a vehicle to travel along a path. This problem is important for societal applications such as eco-routing, which aims to find paths with the lowest exhaust emissions or energy need. The challenges of this problem are three-fold: (1) the dependence of a vehicle's eco-toll on its physical parameters; (2) the lack of access to data with eco-toll information; and (3) the influence of contextual information (i.e. the connections of adjacent segments in the path) on the eco-toll of road segments. Prior work on eco-toll estimation has mostly relied on pure data-driven approaches and has high estimation errors given the limited training data. To address these limitations, we propose a novel Eco-toll estimation Physics-informed Neural Network framework (Eco-PiNN) using three novel ideas, namely, (1) a physics-informed decoder that integrates the physical laws of the vehicle engine into the network, (2) an attention-based contextual information encoder, and (3) a physics-informed regularization to reduce overfitting. Experiments on real-world heavy-duty truck data show that the proposed method can greatly improve the accuracy of eco-toll estimation compared with state-of-the-art methods.  ( 2 min )
    Diatom-inspired architected materials using language-based deep learning: Perception, transformation and manufacturing. (arXiv:2301.05875v1 [cond-mat.mtrl-sci])
    Learning from nature has been a quest of humanity for millennia. While this has taken the form of humans assessing natural designs such as bones, butterfly wings, or spider webs, we can now achieve generating designs using advanced computational algorithms. In this paper we report novel biologically inspired designs of diatom structures, enabled using transformer neural networks, using natural language models to learn, process and transfer insights across manifestations. We illustrate a series of novel diatom-based designs and also report a manufactured specimen, created using additive manufacturing. The method applied here could be expanded to focus on other biological design cues, implement a systematic optimization to meet certain design targets, and include a hybrid set of material design sets.  ( 2 min )
    Artificial Benchmark for Community Detection with Outliers (ABCD+o). (arXiv:2301.05749v1 [cs.SI])
    The Artificial Benchmark for Community Detection graph (ABCD) is a random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs with similar properties as the well-known LFR one, and its main parameter $\xi$ can be tuned to mimic its counterpart in the LFR model, the mixing parameter $\mu$. In this paper, we extend the ABCD model to include potential outliers. We perform some exploratory experiments on both the new ABCD+o model as well as a real-world network to show that outliers possess some desired, distinguishable properties.  ( 2 min )
    Efficient Activation Function Optimization through Surrogate Modeling. (arXiv:2301.05785v1 [cs.LG])
    Carefully designed activation functions can improve the performance of neural networks in many machine learning tasks. However, it is difficult for humans to construct optimal activation functions, and current activation function search algorithms are prohibitively expensive. This paper aims to improve the state of the art through three steps: First, the benchmark datasets Act-Bench-CNN, Act-Bench-ResNet, and Act-Bench-ViT were created by training convolutional, residual, and vision transformer architectures from scratch with 2,913 systematically generated activation functions. Second, a characterization of the benchmark space was developed, leading to a new surrogate-based method for optimization. More specifically, the spectrum of the Fisher information matrix associated with the model's predictive distribution at initialization and the activation function's output distribution were found to be highly predictive of performance. Third, the surrogate was used to discover improved activation functions in CIFAR-100 and ImageNet tasks. Each of these steps is a contribution in its own right; together they serve as a practical and theoretical foundation for further research on activation function optimization. Code is available at https://github.com/cognizant-ai-labs/aquasurf, and the benchmark datasets are at https://github.com/cognizant-ai-labs/act-bench.  ( 2 min )
    Fairness and Sequential Decision Making: Limits, Lessons, and Opportunities. (arXiv:2301.05753v1 [cs.CY])
    As automated decision making and decision assistance systems become common in everyday life, research on the prevention or mitigation of potential harms that arise from decisions made by these systems has proliferated. However, various research communities have independently conceptualized these harms, envisioned potential applications, and proposed interventions. The result is a somewhat fractured landscape of literature focused generally on ensuring decision-making algorithms "do the right thing". In this paper, we compare and discuss work across two major subsets of this literature: algorithmic fairness, which focuses primarily on predictive systems, and ethical decision making, which focuses primarily on sequential decision making and planning. We explore how each of these settings has articulated its normative concerns, the viability of different techniques for these different settings, and how ideas from each setting may have utility for the other.  ( 2 min )
  • Open

    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v5 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.  ( 2 min )
    Unbalanced Optimal Transport, from Theory to Numerics. (arXiv:2211.08775v2 [stat.ML] UPDATED)
    Optimal Transport (OT) has recently emerged as a central tool in data sciences to compare in a geometrically faithful way point clouds and more generally probability distributions. The wide adoption of OT into existing data analysis and machine learning pipelines is however plagued by several shortcomings. This includes its lack of robustness to outliers, its high computational costs, the need for a large number of samples in high dimension and the difficulty to handle data in distinct spaces. In this review, we detail several recently proposed approaches to mitigate these issues. We insist in particular on unbalanced OT, which compares arbitrary positive measures, not restricted to probability distributions (i.e. their total mass can vary). This generalization of OT makes it robust to outliers and missing data. The second workhorse of modern computational OT is entropic regularization, which leads to scalable algorithms while lowering the sample complexity in high dimension. The last point presented in this review is the Gromov-Wasserstein (GW) distance, which extends OT to cope with distributions belonging to different metric spaces. The main motivation for this review is to explain how unbalanced OT, entropic regularization and GW can work hand-in-hand to turn OT into efficient geometric loss functions for data sciences.  ( 2 min )
    Primal Dual Alternating Proximal Gradient Algorithms for Nonsmooth Nonconvex Minimax Problems with Coupled Linear Constraints. (arXiv:2212.04672v2 [math.OC] UPDATED)
    Nonconvex minimax problems have attracted wide attention in machine learning, signal processing and many other fields in recent years. In this paper, we propose a primal dual alternating proximal gradient (PDAPG) algorithm and a primal dual proximal gradient (PDPG-L) algorithm for solving nonsmooth nonconvex-(strongly) concave and nonconvex-linear minimax problems with coupled linear constraints, respectively. The iteration complexity of the two algorithms are proved to be $\mathcal{O}\left( \varepsilon ^{-2} \right)$ (resp. $\mathcal{O}\left( \varepsilon ^{-4} \right)$) under nonconvex-strongly concave (resp. nonconvex-concave) setting and $\mathcal{O}\left( \varepsilon ^{-3} \right)$ under nonconvex-linear setting to reach an $\varepsilon$-stationary point, respectively. To our knowledge, they are the first two algorithms with iteration complexity guarantee for solving the nonconvex minimax problems with coupled linear constraints.  ( 2 min )
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.  ( 2 min )
    Linear Convergence of ISTA and FISTA. (arXiv:2212.06319v2 [math.OC] UPDATED)
    In this paper, we revisit the class of iterative shrinkage-thresholding algorithms (ISTA) for solving the linear inverse problem with sparse representation, which arises in signal and image processing. It is shown in the numerical experiment to deblur an image that the convergence behavior in the logarithmic-scale ordinate tends to be linear instead of logarithmic, approximating to be flat. Making meticulous observations, we find that the previous assumption for the smooth part to be convex weakens the least-square model. Specifically, assuming the smooth part to be strongly convex is more reasonable for the least-square model, even though the image matrix is probably ill-conditioned. Furthermore, we improve the pivotal inequality tighter for composite optimization with the smooth part to be strongly convex instead of general convex, which is first found in [Li et al., 2022]. Based on this pivotal inequality, we generalize the linear convergence to composite optimization in both the objective value and the squared proximal subgradient norm. Meanwhile, we set a simple ill-conditioned matrix which is easy to compute the singular values instead of the original blur matrix. The new numerical experiment shows the proximal generalization of Nesterov's accelerated gradient descent (NAG) for the strongly convex function has a faster linear convergence rate than ISTA. Based on the tighter pivotal inequality, we also generalize the faster linear convergence rate to composite optimization, in both the objective value and the squared proximal subgradient norm, by taking advantage of the well-constructed Lyapunov function with a slight modification and the phase-space representation based on the high-resolution differential equation framework from the implicit-velocity scheme.  ( 2 min )
    Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms. (arXiv:2209.00735v2 [cs.LG] UPDATED)
    Neural networks (NNs) struggle to efficiently solve certain problems, such as learning parities, even when there are simple learning algorithms for those problems. Can NNs discover learning algorithms on their own? We exhibit a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized program. For example, on parity problems, the NN learns as well as Gaussian elimination, an efficient algorithm that can be succinctly described. Our architecture combines both recurrent weight sharing between layers and convolutional weight sharing to reduce the number of parameters down to a constant, even though the network itself may have trillions of nodes. While in practice the constants in our analysis are too large to be directly meaningful, our work suggests that the synergy of Recurrent and Convolutional NNs (RCNNs) may be more natural and powerful than either alone, particularly for concisely parameterizing discrete algorithms.  ( 2 min )
    Failure-informed adaptive sampling for PINNs. (arXiv:2210.00279v3 [math.NA] UPDATED)
    Physics-informed neural networks (PINNs) have emerged as an effective technique for solving PDEs in a wide range of domains. It is noticed, however, the performance of PINNs can vary dramatically with different sampling procedures. For instance, a fixed set of (prior chosen) training points may fail to capture the effective solution region (especially for problems with singularities). To overcome this issue, we present in this work an adaptive strategy, termed the failure-informed PINNs (FI-PINNs), which is inspired by the viewpoint of reliability analysis. The key idea is to define an effective failure probability based on the residual, and then, with the aim of placing more samples in the failure region, the FI-PINNs employs a failure-informed enrichment technique to adaptively add new collocation points to the training set, such that the numerical accuracy is dramatically improved. In short, similar as adaptive finite element methods, the proposed FI-PINNs adopts the failure probability as the posterior error indicator to generate new training points. We prove rigorous error bounds of FI-PINNs and illustrate its performance through several problems.  ( 2 min )
    Nonlinear Independent Component Analysis for Discrete-Time and Continuous-Time Signals. (arXiv:2102.02876v3 [stat.ML] UPDATED)
    We study the classical problem of recovering a multidimensional source signal from observations of nonlinear mixtures of this signal. We show that this recovery is possible (up to a permutation and monotone scaling of the source's original component signals) if the mixture is due to a sufficiently differentiable and invertible but otherwise arbitrarily nonlinear function and the component signals of the source are statistically independent with 'non-degenerate' second-order statistics. The latter assumption requires the source signal to meet one of three regularity conditions which essentially ensure that the source is sufficiently far away from the non-recoverable extremes of being deterministic or constant in time. These assumptions, which cover many popular time series models and stochastic processes, allow us to reformulate the initial problem of nonlinear blind source separation as a simple-to-state problem of optimisation-based function approximation. We propose to solve this approximation problem by minimizing a novel type of objective function that efficiently quantifies the mutual statistical dependence between multiple stochastic processes via cumulant-like statistics. This yields a scalable and direct new method for nonlinear Independent Component Analysis with widely applicable theoretical guarantees and for which our experiments indicate good performance.  ( 2 min )
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v2 [cs.LG] UPDATED)
    We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev- Cantelli inequality (a.k.a. one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive a new concentration of measure inequality, which we name PAC-Bayes-Bennett, since it combines PAC-Bayesian bounding with Bennett's inequality. We use it for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality of Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work of Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.  ( 2 min )
    Universal Prediction Band via Semi-Definite Programming. (arXiv:2103.17203v3 [stat.ML] UPDATED)
    We propose a computationally efficient method to construct nonparametric, heteroscedastic prediction bands for uncertainty quantification, with or without any user-specified predictive model. Our approach provides an alternative to the now-standard conformal prediction for uncertainty quantification, with novel theoretical insights and computational advantages. The data-adaptive prediction band is universally applicable with minimal distributional assumptions, has strong non-asymptotic coverage properties, and is easy to implement using standard convex programs. Our approach can be viewed as a novel variance interpolation with confidence and further leverages techniques from semi-definite programming and sum-of-squares optimization. Theoretical and numerical performances for the proposed approach for uncertainty quantification are analyzed.  ( 2 min )
    Dynamically Mitigating Data Discrepancy with Balanced Focal Loss for Replay Attack Detection. (arXiv:2006.14563v2 [cs.CV] UPDATED)
    It becomes urgent to design effective anti-spoofing algorithms for vulnerable automatic speaker verification systems due to the advancement of high-quality playback devices. Current studies mainly treat anti-spoofing as a binary classification problem between bonafide and spoofed utterances, while lack of indistinguishable samples makes it difficult to train a robust spoofing detector. In this paper, we argue that for anti-spoofing, it needs more attention for indistinguishable samples over easily-classified ones in the modeling process, to make correct discrimination a top priority. Therefore, to mitigate the data discrepancy between training and inference, we propose D3M, to leverage a balanced focal loss function as the training objective to dynamically scale the loss based on the traits of the sample itself. Besides, in the experiments, we select three kinds of features that contain both magnitude-based and phase-based information to form complementary and informative features. Experimental results on the ASVspoof2019 dataset demonstrate the superiority of the proposed methods by comparison between our systems and top-performing ones. Systems trained with the balanced focal loss perform significantly better than conventional cross-entropy loss. With complementary features, our fusion system with only three kinds of features outperforms other systems containing five or more complex single models by 22.5% for min-tDCF and 7% for EER, achieving a min-tDCF and an EER of 0.0124 and 0.55% respectively. Furthermore, we present and discuss the evaluation results on real replay data apart from the simulated ASVspoof2019 data, indicating that research for anti-spoofing still has a long way to go. Source code, analysis data, and other details are publicly available at $\href{https://github.com/asvspoof/D3M}{\text{https://github.com/asvspoof/D3M}}$.  ( 3 min )
    Elastic Similarity and Distance Measures for Multivariate Time Series. (arXiv:2102.10231v2 [cs.LG] UPDATED)
    This paper contributes multivariate versions of seven commonly used elastic similarity and distance measures for time series data analytics. Elastic similarity and distance measures are a class of similarity measures that can compensate for misalignments in the time axis of time series data. We adapt two existing strategies used in a multivariate version of the well-known Dynamic Time Warping (DTW), namely, Independent and Dependent DTW, to these seven measures. While these measures can be applied to various time series analysis tasks, we demonstrate their utility on multivariate time series classification using the nearest neighbor classifier. On 23 well-known datasets, we demonstrate that each of the measures but one achieves the highest accuracy relative to others on at least one dataset, supporting the value of developing a suite of multivariate similarity and distance measures. We also demonstrate that there are datasets for which either the dependent versions of all measures are more accurate than their independent counterparts or vice versa. In addition, we also construct a nearest neighbor-based ensemble of the measures and show that it is competitive to other state-of-the-art single-strategy multivariate time series classifiers.  ( 2 min )
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v2 [cs.CV] UPDATED)
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.  ( 2 min )
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v3 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to an instance-dependent regret upper bound, roughly improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.  ( 2 min )
    Generic Error Bounds for the Generalized Lasso with Sub-Exponential Data. (arXiv:2004.05361v3 [math.ST] UPDATED)
    This work performs a non-asymptotic analysis of the generalized Lasso under the assumption of sub-exponential data. Our main results continue recent research on the benchmark case of (sub-)Gaussian sample distributions and thereby explore what conclusions are still valid when going beyond. While many statistical features remain unaffected (e.g., consistency and error decay rates), the key difference becomes manifested in how the complexity of the hypothesis set is measured. It turns out that the estimation error can be controlled by means of two complexity parameters that arise naturally from a generic-chaining-based proof strategy. The output model can be non-realizable, while the only requirement for the input vector is a generic concentration inequality of Bernstein-type, which can be implemented for a variety of sub-exponential distributions. This abstract approach allows us to reproduce, unify, and extend previously known guarantees for the generalized Lasso. In particular, we present applications to semi-parametric output models and phase retrieval via the lifted Lasso. Moreover, our findings are discussed in the context of sparse recovery and high-dimensional estimation problems.  ( 2 min )
    Black-box Coreset Variational Inference. (arXiv:2211.02377v2 [stat.ML] UPDATED)
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.  ( 2 min )
    On the Exactness of Dantzig-Wolfe Relaxation for Rank Constrained Optimization Problems. (arXiv:2210.16191v2 [math.OC] UPDATED)
    In the rank-constrained optimization problem (RCOP), it minimizes a linear objective function over a prespecified closed rank-constrained domain set and $m$ generic two-sided linear matrix inequalities. Motivated by the Dantzig-Wolfe (DW) decomposition, a popular approach of solving many nonconvex optimization problems, we investigate the strength of DW relaxation (DWR) of the RCOP, which admits the same formulation as RCOP except replacing the domain set by its closed convex hull. Notably, our goal is to characterize conditions under which the DWR matches RCOP for any m two-sided linear matrix inequalities. From the primal perspective, we develop the first-known simultaneously necessary and sufficient conditions that achieve: (i) extreme point exactness -- all the extreme points of the DWR feasible set belong to that of the RCOP; (ii) convex hull exactness -- the DWR feasible set is identical to the closed convex hull of RCOP feasible set; and (iii) objective exactness -- the optimal values of the DWR and RCOP coincide. The proposed conditions unify, refine, and extend the existing exactness results in the quadratically constrained quadratic program (QCQP) and fair unsupervised learning. These conditions can be very useful to identify new results, including the extreme point exactness for a QCQP problem that admits an inhomogeneous objective function with two homogeneous two-sided quadratic constraints and the convex hull exactness for fair SVD.  ( 2 min )
    Outlier Robust and Sparse Estimation of Linear Regression Coefficients. (arXiv:2208.11592v2 [math.ST] UPDATED)
    We consider outlier-robust and sparse estimation of linear regression coefficients, when covariate vectors and noises are sampled, respectively, from an $\mathfrak{L}$-subGaussian distribution and a heavy-tailed distribution. Additionally, the covariate vectors and noises are contaminated by adversarial outliers. We deal with two cases: the covariance matrix of the covariates is known or unknown. Particularly, in the known case, our estimator can attain a nearly information theoretical optimal error bound, and our error bound is sharper than those of earlier studies dealing with similar situations. Our estimator analysis relies heavily on generic chaining to derive sharp error bounds.  ( 2 min )
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v3 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.  ( 2 min )
    Relay Variational Inference: A Method for Accelerated Encoderless VI. (arXiv:2110.13422v2 [cs.LG] UPDATED)
    Variational Inference (VI) offers a method for approximating intractable likelihoods. In neural VI, inference of approximate posteriors is commonly done using an encoder. Alternatively, encoderless VI offers a framework for learning generative models from data without encountering suboptimalities caused by amortization via an encoder (e.g. in presence of missing or uncertain data). However, in absence of an encoder, such methods often suffer in convergence due to the slow nature of gradient steps required to learn the approximate posterior parameters. In this paper, we introduce Relay VI (RVI), a framework that dramatically improves both the convergence and performance of encoderless VI. In our experiments over multiple datasets, we study the effectiveness of RVI in terms of convergence speed, loss, representation power and missing data imputation. We find RVI to be a unique tool, often superior in both performance and convergence speed to previously proposed encoderless as well as amortized VI models (e.g. VAE).  ( 2 min )
    Spectrum of non-Hermitian deep-Hebbian neural networks. (arXiv:2208.11411v2 [q-bio.NC] UPDATED)
    Neural networks with recurrent asymmetric couplings are important to understand how episodic memories are encoded in the brain. Here, we integrate the experimental observation of wide synaptic integration window into our model of sequence retrieval in the continuous time dynamics. The model with non-normal neuron-interactions is theoretically studied by deriving a random matrix theory of the Jacobian matrix in neural dynamics. The spectra bears several distinct features, such as breaking rotational symmetry about the origin, and the emergence of nested voids within the spectrum boundary. The spectral density is thus highly non-uniformly distributed in the complex plane. The random matrix theory also predicts a transition to chaos. In particular, the edge of chaos provides computational benefits for the sequential retrieval of memories. Our work provides a systematic study of time-lagged correlations with arbitrary time delays, and thus can inspire future studies of a broad class of memory models, and even big data analysis of biological time series.  ( 2 min )
    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies. (arXiv:2204.01058v2 [math.PR] UPDATED)
    This article considers fully connected neural networks with Gaussian random weights and biases as well as $L$ hidden layers, each of width proportional to a large parameter $n$. For polynomially bounded non-linearities we give sharp estimates in powers of $1/n$ for the joint cumulants of the network output and its derivatives. Moreover, we show that network cumulants form a perturbatively solvable hierarchy in powers of $1/n$ in that $k$-th order cumulants in one layer have recursions that depend to leading order in $1/n$ only on $j$-th order cumulants at the previous layer with $j\leq k$. By solving a variety of such recursions, however, we find that the depth-to-width ratio $L/n$ plays the role of an effective network depth, controlling both the scale of fluctuations at individual neurons and the size of inter-neuron correlations. Thus, while the cumulant recursions we derive form a hierarchy in powers of $1/n$, contributions of order $1/n^k$ often grow like $L^k$ and are hence non-negligible at positive $L/n$. We use this to study a somewhat simplified version of the exploding and vanishing gradient problem, proving that this particular variant occurs if and only if $L/n$ is large. Several key ideas in this article were first developed at a physics level of rigor in a recent monograph of Daniel A. Roberts, Sho Yaida, and the author. This article not only makes these ideas mathematically precise but also significantly extends them, opening the way to obtaining corrections to all orders in $1/n$.  ( 2 min )
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v3 [cs.LG] UPDATED)
    There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically easy but computationally hard. Empirically, we find that a variety of neural networks successfully learn sparse parities, with discontinuous phase transitions in the training curves. On small instances, learning abruptly occurs at approximately $n^{O(k)}$ iterations; this nearly matches SQ lower bounds, despite the apparent lack of a sparse prior. Our theoretical analysis shows that these observations are not explained by a Langevin-like mechanism, whereby SGD "stumbles in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time). Instead, we show that SGD gradually amplifies the sparse solution via a Fourier gap in the population gradient, making continual progress that is invisible to loss and error metrics.  ( 2 min )
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v4 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.  ( 2 min )
    When saliency goes off on a tangent: Interpreting Deep Neural Networks with nonlinear saliency maps. (arXiv:2110.06639v3 [cs.LG] UPDATED)
    A fundamental bottleneck in utilising complex machine learning systems for critical applications has been not knowing why they do and what they do, thus preventing the development of any crucial safety protocols. To date, no method exist that can provide full insight into the granularity of the neural network's decision process. In the past, saliency maps were an early attempt at resolving this problem through sensitivity calculations, whereby dimensions of a data point are selected based on how sensitive the output of the system is to them. However, the success of saliency maps has been at best limited, mainly due to the fact that they interpret the underlying learning system through a linear approximation. We present a novel class of methods for generating nonlinear saliency maps which fully account for the nonlinearity of the underlying learning system. While agreeing with linear saliency maps on simple problems where linear saliency maps are correct, they clearly identify more specific drivers of classification on complex examples where nonlinearities are more pronounced. This new class of methods significantly aids interpretability of deep neural networks and related machine learning systems. Crucially, they provide a starting point for their more broad use in serious applications, where 'why' is equally important as 'what'.  ( 2 min )
    AutoML Two-Sample Test. (arXiv:2206.08843v3 [cs.LG] UPDATED)
    Two-sample tests are important in statistics and machine learning, both as tools for scientific discovery as well as to detect distribution shifts. This led to the development of many sophisticated test procedures going beyond the standard supervised learning frameworks, whose usage can require specialized knowledge about two-sample testing. We use a simple test that takes the mean discrepancy of a witness function as the test statistic and prove that minimizing a squared loss leads to a witness with optimal testing power. This allows us to leverage recent advancements in AutoML. Without any user input about the problems at hand, and using the same method for all our experiments, our AutoML two-sample test achieves competitive performance on a diverse distribution shift benchmark as well as on challenging two-sample testing problems. We provide an implementation of the AutoML two-sample test in the Python package autotst.  ( 2 min )
    The Missing Invariance Principle Found -- the Reciprocal Twin of Invariant Risk Minimization. (arXiv:2205.14546v2 [cs.LG] UPDATED)
    Machine learning models often generalize poorly to out-of-distribution (OOD) data as a result of relying on features that are spuriously correlated with the label during training. Recently, the technique of Invariant Risk Minimization (IRM) was proposed to learn predictors that only use invariant features by conserving the feature-conditioned label expectation $\mathbb{E}_e[y|f(x)]$ across environments. However, more recent studies have demonstrated that IRM-v1, a practical version of IRM, can fail in various settings. Here, we identify a fundamental flaw of IRM formulation that causes the failure. We then introduce a complementary notion of invariance, MRI, based on conserving the label-conditioned feature expectation $\mathbb{E}_e[f(x)|y]$, which is free of this flaw. Further, we introduce a simplified, practical version of the MRI formulation called MRI-v1. We prove that for general linear problems, MRI-v1 guarantees invariant predictors given sufficient number of environments. We also empirically demonstrate that MRI-v1 strongly out-performs IRM-v1 and consistently achieves near-optimal OOD generalization in image-based nonlinear problems.  ( 2 min )
    RenyiCL: Contrastive Representation Learning with Skew Renyi Divergence. (arXiv:2208.06270v2 [stat.ML] UPDATED)
    Contrastive representation learning seeks to acquire useful representations by estimating the shared information between multiple views of data. Here, the choice of data augmentation is sensitive to the quality of learned representations: as harder the data augmentations are applied, the views share more task-relevant information, but also task-irrelevant one that can hinder the generalization capability of representation. Motivated by this, we present a new robust contrastive learning scheme, coined R\'enyiCL, which can effectively manage harder augmentations by utilizing R\'enyi divergence. Our method is built upon the variational lower bound of R\'enyi divergence, but a na\"ive usage of a variational method is impractical due to the large variance. To tackle this challenge, we propose a novel contrastive objective that conducts variational estimation of a skew R\'enyi divergence and provide a theoretical guarantee on how variational estimation of skew divergence leads to stable training. We show that R\'enyi contrastive learning objectives perform innate hard negative sampling and easy positive sampling simultaneously so that it can selectively learn useful features and ignore nuisance features. Through experiments on ImageNet, we show that R\'enyi contrastive learning with stronger augmentations outperforms other self-supervised methods without extra regularization or computational overhead. Moreover, we also validate our method on other domains such as graph and tabular, showing empirical gain over other contrastive methods.  ( 2 min )
    The Mechanism of Prediction Head in Non-contrastive Self-supervised Learning. (arXiv:2205.06226v3 [cs.LG] UPDATED)
    Recently the surprising discovery of the Bootstrap Your Own Latent (BYOL) method by Grill et al. shows the negative term in contrastive loss can be removed if we add the so-called prediction head to the network. This initiated the research of non-contrastive self-supervised learning. It is mysterious why even when there exist trivial collapsed global optimal solutions, neural networks trained by (stochastic) gradient descent can still learn competitive representations. This phenomenon is a typical example of implicit bias in deep learning and remains little understood. In this work, we present our empirical and theoretical discoveries on non-contrastive self-supervised learning. Empirically, we find that when the prediction head is initialized as an identity matrix with only its off-diagonal entries being trainable, the network can learn competitive representations even though the trivial optima still exist in the training objective. Theoretically, we present a framework to understand the behavior of the trainable, but identity-initialized prediction head. Under a simple setting, we characterized the substitution effect and acceleration effect of the prediction head. The substitution effect happens when learning the stronger features in some neurons can substitute for learning these features in other neurons through updating the prediction head. And the acceleration effect happens when the substituted features can accelerate the learning of other weaker features to prevent them from being ignored. These two effects enable the neural networks to learn all the features rather than focus only on learning the stronger features, which is likely the cause of the dimensional collapse phenomenon. To the best of our knowledge, this is also the first end-to-end optimization guarantee for non-contrastive methods using nonlinear neural networks with a trainable prediction head and normalization.  ( 3 min )
    coVariance Neural Networks. (arXiv:2205.15856v4 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.  ( 2 min )
    Analysis of autocorrelation times in Neural Markov Chain Monte Carlo simulations. (arXiv:2111.10189v3 [cond-mat.stat-mech] UPDATED)
    We provide a deepened study of autocorrelations in Neural Markov Chain Monte Carlo (NMCMC) simulations, a version of the traditional Metropolis algorithm which employs neural networks to provide independent proposals. We illustrate our ideas using the two-dimensional Ising model. We discuss several estimates of autocorrelation times in the context of NMCMC, some inspired by analytical results derived for the Metropolized Independent Sampler (MIS). We check their reliability by estimating them on a small system where analytical results can also be obtained. Based on the analytical results for MIS we propose a new loss function and study its impact on the autocorelation times. Although, this function's performance is a bit inferior to the traditional Kullback-Leibler divergence, it offers two training algorithms which in some situations may be beneficial. By studying a small, $4 \times 4$, system we gain access to the dynamics of the training process which we visualize using several observables. Furthermore, we quantitatively investigate the impact of imposing global discrete symmetries of the system in the neural network training process on the autocorrelation times. Eventually, we propose a scheme which incorporates partial heat-bath updates which considerably improves the quality of the training. The impact of the above enhancements is discussed for a $16 \times 16$ spin system. The summary of our findings may serve as a guidance to the implementation of Neural Markov Chain Monte Carlo simulations for more complicated models.  ( 2 min )
    Split-kl and PAC-Bayes-split-kl Inequalities for Ternary Random Variables. (arXiv:2206.00706v2 [stat.ML] UPDATED)
    We present a new concentration of measure inequality for sums of independent bounded random variables, which we name a split-kl inequality. The inequality is particularly well-suited for ternary random variables, which naturally show up in a variety of problems, including analysis of excess losses in classification, analysis of weighted majority votes, and learning with abstention. We demonstrate that for ternary random variables the inequality is simultaneously competitive with the kl inequality, the Empirical Bernstein inequality, and the Unexpected Bernstein inequality, and in certain regimes outperforms all of them. It resolves an open question by Tolstikhin and Seldin [2013] and Mhammedi et al. [2019] on how to match simultaneously the combinatorial power of the kl inequality when the distribution happens to be close to binary and the power of Bernstein inequalities to exploit low variance when the probability mass is concentrated on the middle value. We also derive a PAC-Bayes-split-kl inequality and compare it with the PAC-Bayes-kl, PAC-Bayes-Empirical-Bennett, and PAC-Bayes-Unexpected-Bernstein inequalities in an analysis of excess losses and in an analysis of a weighted majority vote for several UCI datasets. Last but not least, our study provides the first direct comparison of the Empirical Bernstein and Unexpected Bernstein inequalities and their PAC-Bayes extensions.  ( 2 min )
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v2 [cs.LG] UPDATED)
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D molecular graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D molecular graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.  ( 2 min )
    Recipes for when Physics Fails: Recovering Robust Learning of Physics Informed Neural Networks. (arXiv:2110.13330v2 [cs.LG] UPDATED)
    Physics-informed Neural Networks (PINNs) have been shown to be effective in solving partial differential equations by capturing the physics induced constraints as a part of the training loss function. This paper shows that a PINN can be sensitive to errors in training data and overfit itself in dynamically propagating these errors over the domain of the solution of the PDE. It also shows how physical regularizations based on continuity criteria and conservation laws fail to address this issue and rather introduce problems of their own causing the deep network to converge to a physics-obeying local minimum instead of the global minimum. We introduce Gaussian Process (GP) based smoothing that recovers the performance of a PINN and promises a robust architecture against noise/errors in measurements. Additionally, we illustrate an inexpensive method of quantifying the evolution of uncertainty based on the variance estimation of GPs on boundary data. Robust PINN performance is also shown to be achievable by choice of sparse sets of inducing points based on sparsely induced GPs. We demonstrate the performance of our proposed methods and compare the results from existing benchmark models in literature for time-dependent Schr\"odinger and Burgers' equations.  ( 2 min )
    Sinkhorn Divergences for Unbalanced Optimal Transport. (arXiv:1910.12958v3 [math.OC] UPDATED)
    Optimal transport induces the Earth Mover's (Wasserstein) distance between probability distributions, a geometric divergence that is relevant to a wide range of problems. Over the last decade, two relaxations of optimal transport have been studied in depth: unbalanced transport, which is robust to the presence of outliers and can be used when distributions don't have the same total mass; entropy-regularized transport, which is robust to sampling noise and lends itself to fast computations using the Sinkhorn algorithm. This paper combines both lines of work to put robust optimal transport on solid ground. Our main contribution is a generalization of the Sinkhorn algorithm to unbalanced transport: our method alternates between the standard Sinkhorn updates and the pointwise application of a contractive function. This implies that entropic transport solvers on grid images, point clouds and sampled distributions can all be modified easily to support unbalanced transport, with a proof of linear convergence that holds in all settings. We then show how to use this method to define pseudo-distances on the full space of positive measures that satisfy key geometric axioms: (unbalanced) Sinkhorn divergences are differentiable, positive, definite, convex, statistically robust and avoid any "entropic bias" towards a shrinkage of the measures' supports.  ( 2 min )
    Generalization Error Bounds for Multiclass Sparse Linear Classifiers. (arXiv:2204.06264v2 [math.ST] UPDATED)
    We consider high-dimensional multiclass classification by sparse multinomial logistic regression. Unlike binary classification, in the multiclass setup one can think about an entire spectrum of possible notions of sparsity associated with different structural assumptions on the regression coefficients matrix. We propose a computationally feasible feature selection procedure based on penalized maximum likelihood with convex penalties capturing a specific type of sparsity at hand. In particular, we consider global sparsity, double row-wise sparsity, and low-rank sparsity, and show that with the properly chosen tuning parameters the derived plug-in classifiers attain the minimax generalization error bounds (in terms of misclassification excess risk) within the corresponding classes of multiclass sparse linear classifiers. The developed approach is general and can be adapted to other types of sparsity as well.  ( 2 min )
    Post-training Quantization for Neural Networks with Provable Guarantees. (arXiv:2201.11113v3 [cs.LG] UPDATED)
    While neural networks have been remarkably successful in a wide array of applications, implementing them in resource-constrained hardware remains an area of intense research. By replacing the weights of a neural network with quantized (e.g., 4-bit, or binary) counterparts, massive savings in computation cost, memory, and power consumption are attained. To that end, we generalize a post-training neural-network quantization method, GPFQ, that is based on a greedy path-following mechanism. Among other things, we propose modifications to promote sparsity of the weights, and rigorously analyze the associated error. Additionally, our error analysis expands the results of previous work on GPFQ to handle general quantization alphabets, showing that for quantizing a single-layer network, the relative square error essentially decays linearly in the number of weights -- i.e., level of over-parametrization. Our result holds across a range of input distributions and for both fully-connected and convolutional architectures thereby also extending previous results. To empirically evaluate the method, we quantize several common architectures with few bits per weight, and test them on ImageNet, showing only minor loss of accuracy compared to unquantized models. We also demonstrate that standard modifications, such as bias correction and mixed precision quantization, further improve accuracy.  ( 2 min )
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v3 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.  ( 2 min )
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v5 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.  ( 2 min )
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v3 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.  ( 2 min )
    Toward Explainable AI for Regression Models. (arXiv:2112.11407v2 [cs.LG] UPDATED)
    In addition to the impressive predictive power of machine learning (ML) models, more recently, explanation methods have emerged that enable an interpretation of complex non-linear learning models such as deep neural networks. Gaining a better understanding is especially important e.g. for safety-critical ML applications or medical diagnostics etc. While such Explainable AI (XAI) techniques have reached significant popularity for classifiers, so far little attention has been devoted to XAI for regression models (XAIR). In this review, we clarify the fundamental conceptual differences of XAI for regression and classification tasks, establish novel theoretical insights and analysis for XAIR, provide demonstrations of XAIR on genuine practical regression problems, and finally discuss the challenges remaining for the field.  ( 2 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v4 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.  ( 2 min )
    Fast Bayesian Coresets via Subsampling and Quasi-Newton Refinement. (arXiv:2203.09675v3 [stat.ML] UPDATED)
    Bayesian coresets approximate a posterior distribution by building a small weighted subset of the data points. Any inference procedure that is too computationally expensive to be run on the full posterior can instead be run inexpensively on the coreset, with results that approximate those on the full data. However, current approaches are limited by either a significant run-time or the need for the user to specify a low-cost approximation to the full posterior. We propose a Bayesian coreset construction algorithm that first selects a uniformly random subset of data, and then optimizes the weights using a novel quasi-Newton method. Our algorithm is a simple to implement, black-box method, that does not require the user to specify a low-cost posterior approximation. It is the first to come with a general high-probability bound on the KL divergence of the output coreset posterior. Experiments demonstrate that our method provides significant improvements in coreset quality against alternatives with comparable construction times, with far less storage cost and user input required.  ( 2 min )
    Forecasting Market Changes using Variational Inference. (arXiv:2205.00605v2 [q-fin.ST] UPDATED)
    Though various approaches have been considered, forecasting near-term market changes of equities and similar market data remains quite difficult. In this paper we introduce an approach to forecast near-term market changes for equity indices as well as portfolios using variational inference (VI). VI is a machine learning approach which uses optimization techniques to estimate complex probability densities. In the proposed approach, clusters of explanatory variables are identified and market changes are forecast based on cluster-specific linear regression. Apart from the expected value of changes, the proposed approach can also be used to obtain the distribution of possible outcomes. Another advantage of the proposed approach is the clear model interpretation, as clusters of explanatory variables (or market regimes) are identified for which the future changes follow similar relationships. Knowledge about such clusters can provide useful insights about portfolio performance and identify the relative importance of variables in different market regimes. An illustrative example of predicting one-day S\&P change is considered and it is shown that even with as few as three explanatory variables, the proposed approach provides useful predictions.  ( 2 min )
    Random Planted Forest: a directly interpretable tree ensemble. (arXiv:2012.14563v2 [stat.ML] UPDATED)
    We introduce a novel interpretable, tree based algorithm for prediction in a regression setting in which each tree in a classical random forest is replaced by a family of planted trees that grow simultaneously. The motivation for our algorithm is to estimate the unknown regression function from a functional decomposition perspective, where each tree corresponds to a function within that decomposition. The maximal order of approximation in the decomposition can be specified or left unlimited. If a first order approximation is chosen, the result is an additive model. In the other extreme case, if the order of approximation is not limited, the resulting model places no restrictions on the form of the regression function. In a simulation study we find encouraging prediction and visualisation properties of our random planted forest method. We also develop theory for an idealised version of random planted forests in cases where the maximal order of approximation is low. We show that if the order is smaller than three, the idealised version achieves asymptotically optimal convergence rates up to a logarithmic factor. ode is available on https://github.com/PlantedML/randomPlantedForest  ( 2 min )
    The Unbalanced Gromov Wasserstein Distance: Conic Formulation and Relaxation. (arXiv:2009.04266v3 [math.OC] UPDATED)
    Comparing metric measure spaces (i.e. a metric space endowed with aprobability distribution) is at the heart of many machine learning problems. The most popular distance between such metric measure spaces is theGromov-Wasserstein (GW) distance, which is the solution of a quadratic assignment problem. The GW distance is however limited to the comparison of metric measure spaces endowed with a probability distribution. To alleviate this issue, we introduce two Unbalanced Gromov-Wasserstein formulations: a distance and a more tractable upper-bounding relaxation.They both allow the comparison of metric spaces equipped with arbitrary positive measures up to isometries. The first formulation is a positive and definite divergence based on a relaxation of the mass conservation constraint using a novel type of quadratically-homogeneous divergence. This divergence works hand in hand with the entropic regularization approach which is popular to solve large scale optimal transport problems. We show that the underlying non-convex optimization problem can be efficiently tackled using a highly parallelizable and GPU-friendly iterative scheme. The second formulation is a distance between mm-spaces up to isometries based on a conic lifting. Lastly, we provide numerical experiments onsynthetic examples and domain adaptation data with a Positive-Unlabeled learning task to highlight the salient features of the unbalanced divergence and its potential applications in ML.  ( 2 min )
    A Concentration of Measure Framework to study convex problems and other implicit formulation problems in machine learning. (arXiv:2010.09877v2 [math.PR] UPDATED)
    This paper provides a framework to show the concentration of solutions $Y^*$ to convex minimizing problem where the objective function $\phi(X)(Y)$ depends on some random vector $X$ satisfying concentration of measure hypotheses. More precisely, the convex problem translates into a contractive fixed point equation that ensure the transmission of the concentration from $X$ to $Y^*$. This result is of central interest to characterize many machine learning algorithms which are defined through implicit equations (e.g., logistic regression, lasso, boosting, etc.). Based on our framework, we provide precise estimations for the first moments of the solution $Y^*$, when $X= (x_1,\ldots, x_n)$ is a data matrix of independent columns and $\phi(X)(y)$ writes as a sum $\frac{1}{n}\sum_{i=1}^n h_i(x_i^TY)$. That allows to describe the behavior and performance (e.g., generalization error) of a wide variety of machine learning classifiers.  ( 2 min )
    Temporal-Logic-Based Reward Shaping for Continuing Reinforcement Learning Tasks. (arXiv:2007.01498v2 [cs.AI] UPDATED)
    In continuing tasks, average-reward reinforcement learning may be a more appropriate problem formulation than the more common discounted reward formulation. As usual, learning an optimal policy in this setting typically requires a large amount of training experiences. Reward shaping is a common approach for incorporating domain knowledge into reinforcement learning in order to speed up convergence to an optimal policy. However, to the best of our knowledge, the theoretical properties of reward shaping have thus far only been established in the discounted setting. This paper presents the first reward shaping framework for average-reward learning and proves that, under standard assumptions, the optimal policy under the original reward function can be recovered. In order to avoid the need for manual construction of the shaping function, we introduce a method for utilizing domain knowledge expressed as a temporal logic formula. The formula is automatically translated to a shaping function that provides additional reward throughout the learning process. We evaluate the proposed method on three continuing tasks. In all cases, shaping speeds up the average-reward learning rate without any reduction in the performance of the learned policy compared to relevant baselines.  ( 2 min )
    Necessary and Sufficient Conditions for Inverse Reinforcement Learning of Bayesian Stopping Time Problems. (arXiv:2007.03481v5 [cs.LG] UPDATED)
    This paper presents an inverse reinforcement learning~(IRL) framework for Bayesian stopping time problems. By observing the actions of a Bayesian decision maker, we provide a necessary and sufficient condition to identify if these actions are consistent with optimizing a cost function. In a Bayesian (partially observed) setting, the inverse learner can at best identify optimality wrt the observed actions. Our IRL algorithm identifies optimality and then constructs set valued estimates of the cost function. To achieve this IRL objective, we use novel ideas from Bayesian revealed preferences stemming from microeconomics. We illustrate the proposed IRL scheme using two important examples of stopping time problems, namely, sequential hypothesis testing and Bayesian search, and also on a real-world YouTube dataset. Finally, for finite datasets, we propose an IRL detection algorithm and give finite sample bounds on its error probabilities.  ( 2 min )
    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Langevin algorithms. (arXiv:2006.14514v4 [cs.LG] UPDATED)
    Artificial neural networks (ANNs) are typically highly nonlinear systems which are finely tuned via the optimization of their associated, non-convex loss functions. In many cases, the gradient of any such loss function has superlinear growth, making the use of the widely-accepted (stochastic) gradient descent methods, which are based on Euler numerical schemes, problematic. We offer a new learning algorithm based on an appropriately constructed variant of the popular stochastic gradient Langevin dynamics (SGLD), which is called tamed unadjusted stochastic Langevin algorithm (TUSLA). We also provide a nonasymptotic analysis of the new algorithm's convergence properties in the context of non-convex learning problems with the use of ANNs. Thus, we provide finite-time guarantees for TUSLA to find approximate minimizers of both empirical and population risks. The roots of the TUSLA algorithm are based on the taming technology for diffusion processes with superlinear coefficients as developed in \citet{tamed-euler, SabanisAoAP} and for MCMC algorithms in \citet{tula}. Numerical experiments are presented which confirm the theoretical findings and illustrate the need for the use of the new algorithm in comparison to vanilla SGLD within the framework of ANNs.  ( 2 min )
    Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning. (arXiv:1909.05850v6 [stat.ML] UPDATED)
    Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly-off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on Double Reinforcement Learning (DRL) that leverages this structure for OPE using the efficient influence function we derive. Our DRL estimator simultaneously uses estimated stationary density ratios and $q$-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE.  ( 2 min )
    Decentralized Exploration in Multi-Armed Bandits -- Extended version. (arXiv:1811.07763v6 [cs.LG] UPDATED)
    We consider the decentralized exploration problem: a set of players collaborate to identify the best arm by asynchronously interacting with the same stochastic environment. The objective is to insure privacy in the best arm identification problem between asynchronous, collaborative, and thrifty players. In the context of a digital service, we advocate that this decentralized approach allows a good balance between the interests of users and those of service providers: the providers optimize their services, while protecting the privacy of the users and saving resources. We define the privacy level as the amount of information an adversary could infer by intercepting the messages concerning a single user. We provide a generic algorithm Decentralized Elimination, which uses any best arm identification algorithm as a subroutine. We prove that this algorithm insures privacy, with a low communication cost, and that in comparison to the lower bound of the best arm identification problem, its sample complexity suffers from a penalty depending on the inverse of the probability of the most frequent players. Then, thanks to the genericity of the approach, we extend the proposed algorithm to the non-stationary bandits. Finally, experiments illustrate and complete the analysis.  ( 2 min )
    Martingale Methods for Sequential Estimation of Convex Functionals and Divergences. (arXiv:2103.09267v3 [math.ST] UPDATED)
    We present a unified technique for sequential estimation of convex divergences between distributions, including integral probability metrics like the kernel maximum mean discrepancy, $\varphi$-divergences like the Kullback-Leibler divergence, and optimal transport costs, such as powers of Wasserstein distances. This is achieved by observing that empirical convex divergences are (partially ordered) reverse submartingales with respect to the exchangeable filtration, coupled with maximal inequalities for such processes. These techniques appear to be complementary and powerful additions to the existing literature on both confidence sequences and convex divergences. We construct an offline-to-sequential device that converts a wide array of existing offline concentration inequalities into time-uniform confidence sequences that can be continuously monitored, providing valid tests or confidence intervals at arbitrary stopping times. The resulting sequential bounds pay only an iterated logarithmic price over the corresponding fixed-time bounds, retaining the same dependence on problem parameters (like dimension or alphabet size if applicable). These results are also applicable to more general convex functionals, like the negative differential entropy, suprema of empirical processes, and V-Statistics.  ( 2 min )
    Robust Max Entrywise Error Bounds for Tensor Estimation from Sparse Observations via Similarity Based Collaborative Filtering. (arXiv:1908.01241v4 [cs.LG] UPDATED)
    Consider the task of estimating a 3-order $n \times n \times n$ tensor from noisy observations of randomly chosen entries in the sparse regime. We introduce a similarity based collaborative filtering algorithm for estimating a tensor from sparse observations and argue that it achieves sample complexity that nearly matches the conjectured computationally efficient lower bound on the sample complexity for the setting of low-rank tensors. Our algorithm uses the matrix obtained from the flattened tensor to compute similarity, and estimates the tensor entries using a nearest neighbor estimator. We prove that the algorithm recovers a finite rank tensor with maximum entry-wise error (MEE) and mean-squared-error (MSE) decaying to $0$ as long as each entry is observed independently with probability $p = \Omega(n^{-3/2 + \kappa})$ for any arbitrarily small $\kappa > 0$. More generally, we establish robustness of the estimator, showing that when arbitrary noise bounded by $\varepsilon \geq 0$ is added to each observation, the estimation error with respect to MEE and MSE degrades by $\text{poly}(\varepsilon)$. Consequently, even if the tensor may not have finite rank but can be approximated within $\varepsilon \geq 0$ by a finite rank tensor, then the estimation error converges to $\text{poly}(\varepsilon)$. Our analysis sheds insight into the conjectured sample complexity lower bound, showing that it matches the connectivity threshold of the graph used by our algorithm for estimating similarity between coordinates.  ( 2 min )
    Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks. (arXiv:1906.04893v2 [cs.LG] UPDATED)
    Tight estimation of the Lipschitz constant for deep neural networks (DNNs) is useful in many applications ranging from robustness certification of classifiers to stability analysis of closed-loop systems with reinforcement learning controllers. Existing methods in the literature for estimating the Lipschitz constant suffer from either lack of accuracy or poor scalability. In this paper, we present a convex optimization framework to compute guaranteed upper bounds on the Lipschitz constant of DNNs both accurately and efficiently. Our main idea is to interpret activation functions as gradients of convex potential functions. Hence, they satisfy certain properties that can be described by quadratic constraints. This particular description allows us to pose the Lipschitz constant estimation problem as a semidefinite program (SDP). The resulting SDP can be adapted to increase either the estimation accuracy (by capturing the interaction between activation functions of different layers) or scalability (by decomposition and parallel implementation). We illustrate the utility of our approach with a variety of experiments on randomly generated networks and on classifiers trained on the MNIST and Iris datasets. In particular, we experimentally demonstrate that our Lipschitz bounds are the most accurate compared to those in the literature. We also study the impact of adversarial training methods on the Lipschitz bounds of the resulting classifiers and show that our bounds can be used to efficiently provide robustness guarantees.  ( 2 min )
    On the role of Model Uncertainties in Bayesian Optimization. (arXiv:2301.05983v1 [stat.ML])
    Bayesian optimization (BO) is a popular method for black-box optimization, which relies on uncertainty as part of its decision-making process when deciding which experiment to perform next. However, not much work has addressed the effect of uncertainty on the performance of the BO algorithm and to what extent calibrated uncertainties improve the ability to find the global optimum. In this work, we provide an extensive study of the relationship between the BO performance (regret) and uncertainty calibration for popular surrogate models and compare them across both synthetic and real-world experiments. Our results confirm that Gaussian Processes are strong surrogate models and that they tend to outperform other popular models. Our results further show a positive association between calibration error and regret, but interestingly, this association disappears when we control for the type of model in the analysis. We also studied the effect of re-calibration and demonstrate that it generally does not lead to improved regret. Finally, we provide theoretical justification for why uncertainty calibration might be difficult to combine with BO due to the small sample sizes commonly used.  ( 2 min )
    Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning. (arXiv:2301.07067v1 [cs.LG])
    In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. This implicit training is in contrast to explicitly tuning the model weights based on examples. In this work, we formalize in-context learning as an algorithm learning problem, treating the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer, which holds under mild assumptions. Secondly, we use our abstraction to show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes. We provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) identify an inductive bias phenomenon where the transfer risk on unseen tasks is independent of the transformer complexity, and (3) empirically verify our theoretical predictions.  ( 2 min )
    A Fast Algorithm for Adaptive Private Mean Estimation. (arXiv:2301.07078v1 [stat.ML])
    We design an $(\varepsilon, \delta)$-differentially private algorithm to estimate the mean of a $d$-variate distribution, with unknown covariance $\Sigma$, that is adaptive to $\Sigma$. To within polylogarithmic factors, the estimator achieves optimal rates of convergence with respect to the induced Mahalanobis norm $||\cdot||_\Sigma$, takes time $\tilde{O}(n d^2)$ to compute, has near linear sample complexity for sub-Gaussian distributions, allows $\Sigma$ to be degenerate or low rank, and adaptively extends beyond sub-Gaussianity. Prior to this work, other methods required exponential computation time or the superlinear scaling $n = \Omega(d^{3/2})$ to achieve non-trivial error with respect to the norm $||\cdot||_\Sigma$.  ( 2 min )
    MAFUS: a Framework to predict mortality risk in MAFLD subjects. (arXiv:2301.06908v1 [stat.ML])
    Metabolic (dysfunction) associated fatty liver disease (MAFLD) establishes new criteria for diagnosing fatty liver disease independent of alcohol consumption and concurrent viral hepatitis infection. However, the long-term outcome of MAFLD subjects is sparse. Few articles are focused on mortality in MAFLD subjects, and none investigate how to predict a fatal outcome. In this paper, we propose an artificial intelligence-based framework named MAFUS that physicians can use for predicting mortality in MAFLD subjects. The framework uses data from various anthropometric and biochemical sources based on Machine Learning (ML) algorithms. The framework has been tested on a state-of-the-art dataset on which five ML algorithms are trained. Support Vector Machines resulted in being the best model. Furthermore, an Explainable Artificial Intelligence (XAI) analysis has been performed to understand the SVM diagnostic reasoning and the contribution of each feature to the prediction. The MAFUS framework is easy to apply, and the required parameters are readily available in the dataset.  ( 2 min )
    Enhancing Deep Traffic Forecasting Models with Dynamic Regression. (arXiv:2301.06650v1 [cs.LG])
    A common assumption in deep learning-based multivariate and multistep traffic time series forecasting models is that residuals are independent, isotropic, and uncorrelated in space and time. While this assumption provides a straightforward loss function (such as MAE/MSE), it is inevitable that residual processes will exhibit strong autocorrelation and structured spatiotemporal correlation. In this paper, we propose a complementary dynamic regression (DR) framework to enhance existing deep multistep traffic forecasting frameworks through structured specifications and learning for the residual process. Specifically, we assume the residuals of the base model (i.e., a well-developed traffic forecasting model) are governed by a matrix-variate seasonal autoregressive (AR) model, which can be seamlessly integrated into the training process by redesigning the overall loss function. Parameters in the DR framework can be jointly learned with the base model. We evaluate the effectiveness of the proposed framework in enhancing several state-of-the-art deep traffic forecasting models on both speed and flow datasets. Our experiment results show that the DR framework not only improves existing traffic forecasting models but also offers interpretable regression coefficients and spatiotemporal covariance matrices.  ( 2 min )
    Neural Operator Framework for Digital Twin and Complex Engineering Systems. (arXiv:2301.06701v1 [cs.LG])
    With modern computational advancements and statistical analysis methods, machine learning algorithms have become a vital part of engineering modeling. Neural Operator Networks (ONets) is an emerging machine learning algorithm as a "faster surrogate" for approximating solutions to partial differential equations (PDEs) due to their ability to approximate mathematical operators versus the direct approximation of Neural Networks (NN). ONets use the Universal Approximation Theorem to map finite-dimensional inputs to infinite-dimensional space using the branch-trunk architecture, which encodes domain and feature information separately before using a dot product to combine the information. ONets are expected to occupy a vital niche for surrogate modeling in physical systems and Digital Twin (DT) development. Three test cases are evaluated using ONets for operator approximation, including a 1-dimensional ordinary differential equations (ODE), general diffusion system, and convection-diffusion (Burger) system. Solutions for ODE and diffusion systems yield accurate and reliable results (R2>0.95), while solutions for Burger systems need further refinement in the ONet algorithm.  ( 2 min )
    $Ae^2I$: A Double Autoencoder for Imputation of Missing Values. (arXiv:2301.06633v1 [cs.LG])
    The most common strategy of imputing missing values in a table is to study either the column-column relationship or the row-row relationship of the data table, then use the relationship to impute the missing values based on the non-missing values from other columns of the same row, or from the other rows of the same column. This paper introduces a double autoencoder for imputation ($Ae^2I$) that simultaneously and collaboratively uses both row-row relationship and column-column relationship to impute the missing values. Empirical tests on Movielens 1M dataset demonstrated that $Ae^2I$ outperforms the current state-of-the-art models for recommender systems by a significant margin.  ( 2 min )
    Deep Conditional Measure Quantization. (arXiv:2301.06907v1 [stat.ML])
    The quantization of a (probability) measure is replacing it by a sum of Dirac masses that is close enough to it (in some metric space of probability measures). Various methods exists to do so, but the situation of quantizing a conditional law has been less explored. We propose a method, called DCMQ, involving a Huber-energy kernel-based approach coupled with a deep neural network architecture. The method is tested on several examples and obtains promising results.  ( 2 min )
    Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency. (arXiv:2301.06240v1 [math.ST])
    We study optimal procedures for estimating a linear functional based on observational data. In many problems of this kind, a widely used assumption is strict overlap, i.e., uniform boundedness of the importance ratio, which measures how well the observational data covers the directions of interest. When it is violated, the classical semi-parametric efficiency bound can easily become infinite, so that the instance-optimal risk depends on the function class used to model the regression function. For any convex and symmetric function class $\mathcal{F}$, we derive a non-asymptotic local minimax bound on the mean-squared error in estimating a broad class of linear functionals. This lower bound refines the classical semi-parametric one, and makes connections to moduli of continuity in functional estimation. When $\mathcal{F}$ is a reproducing kernel Hilbert space, we prove that this lower bound can be achieved up to a constant factor by analyzing a computationally simple regression estimator. We apply our general results to various families of examples, thereby uncovering a spectrum of rates that interpolate between the classical theories of semi-parametric efficiency (with $\sqrt{n}$-consistency) and the slower minimax rates associated with non-parametric function estimation.  ( 2 min )
    Theoretical and computational aspects of robust optimal transportation, with applications to statistics and machine learning. (arXiv:2301.06297v1 [math.ST])
    Optimal transport (OT) theory and the related $p$-Wasserstein distance ($W_p$, $p\geq 1$) are popular tools in statistics and machine learning. Recent studies have been remarking that inference based on OT and on $W_p$ is sensitive to outliers. To cope with this issue, we work on a robust version of the primal OT problem (ROBOT) and show that it defines a robust version of $W_1$, called robust Wasserstein distance, which is able to downweight the impact of outliers. We study properties of this novel distance and use it to define minimum distance estimators. Our novel estimators do not impose any moment restrictions: this allows us to extend the use of OT methods to inference on heavy-tailed distributions. We also provide statistical guarantees of the proposed estimators. Moreover, we derive the dual form of the ROBOT and illustrate its applicability to machine learning. Numerical exercises (see also the supplementary material) provide evidence of the benefits yielded by our methods.  ( 2 min )
    Doubly Robust Counterfactual Classification. (arXiv:2301.06199v1 [cs.LG])
    We study counterfactual classification as a new tool for decision-making under hypothetical (contrary to fact) scenarios. We propose a doubly-robust nonparametric estimator for a general counterfactual classifier, where we can incorporate flexible constraints by casting the classification problem as a nonlinear mathematical program involving counterfactuals. We go on to analyze the rates of convergence of the estimator and provide a closed-form expression for its asymptotic distribution. Our analysis shows that the proposed estimator is robust against nuisance model misspecification, and can attain fast $\sqrt{n}$ rates with tractable inference even when using nonparametric machine learning approaches. We study the empirical performance of our methods by simulation and apply them for recidivism risk prediction.  ( 2 min )
    Data-aware customization of activation functions reduces neural network error. (arXiv:2301.06635v1 [cs.LG])
    Activation functions play critical roles in neural networks, yet current off-the-shelf neural networks pay little attention to the specific choice of activation functions used. Here we show that data-aware customization of activation functions can result in striking reductions in neural network error. We first give a simple linear algebraic explanation of the role of activation functions in neural networks; then, through connection with the Diaconis-Shahshahani Approximation Theorem, we propose a set of criteria for good activation functions. As a case study, we consider regression tasks with a partially exchangeable target function, \emph{i.e.} $f(u,v,w)=f(v,u,w)$ for $u,v\in \mathbb{R}^d$ and $w\in \mathbb{R}^k$, and prove that for such a target function, using an even activation function in at least one of the layers guarantees that the prediction preserves partial exchangeability for best performance. Since even activation functions are seldom used in practice, we designed the ``seagull'' even activation function $\log(1+x^2)$ according to our criteria. Empirical testing on over two dozen 9-25 dimensional examples with different local smoothness, curvature, and degree of exchangeability revealed that a simple substitution with the ``seagull'' activation function in an already-refined neural network can lead to an order-of-magnitude reduction in error. This improvement was most pronounced when the activation function substitution was applied to the layer in which the exchangeable variables are connected for the first time. While the improvement is greatest for low-dimensional data, experiments on the CIFAR10 image classification dataset showed that use of ``seagull'' can reduce error even for high-dimensional cases. These results collectively highlight the potential of customizing activation functions as a general approach to improve neural network performance.  ( 2 min )
    Asymptotic normality and optimality in nonsmooth stochastic approximation. (arXiv:2301.06632v1 [math.OC])
    In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of H\'{a}jek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case.  ( 2 min )
    Geometric ergodicity of SGLD via reflection coupling. (arXiv:2301.06769v1 [math.PR])
    We consider the geometric ergodicity of the Stochastic Gradient Langevin Dynamics (SGLD) algorithm under nonconvexity settings. Via the technique of reflection coupling, we prove the Wasserstein contraction of SGLD when the target distribution is log-concave only outside some compact set. The time discretization and the minibatch in SGLD introduce several difficulties when applying the reflection coupling, which are addressed by a series of careful estimates of conditional expectations. As a direct corollary, the SGLD with constant step size has an invariant distribution and we are able to obtain its geometric ergodicity in terms of $W_1$ distance. The generalization to non-gradient drifts is also included.  ( 2 min )
    Case-Base Neural Networks: survival analysis with time-varying, higher-order interactions. (arXiv:2301.06535v1 [stat.ML])
    Neural network-based survival methods can model data-driven covariate interactions. While these methods have led to better predictive performance than regression-based approaches, they cannot model both time-varying interactions and complex baseline hazards. To address this, we propose Case-Base Neural Networks (CBNN) as a new approach that combines the case-base sampling framework with flexible architectures. Our method naturally accounts for censoring and does not require method specific hyperparameters. Using a novel sampling scheme and data augmentation, we incorporate time directly into a feed-forward neural network. CBNN predicts the probability of an event occurring at a given moment and estimates the hazard function. We compare the performance of CBNN to survival methods based on regression and neural networks in two simulations and two real data applications. We report two time-dependent metrics for each model. In the simulations and real data applications, CBNN provides a more consistent predictive performance across time and outperforms the competing neural network approaches. For a simple simulation with an exponential hazard model, CBNN outperforms the other neural network methods. For a complex simulation, which highlights the ability of CBNN to model both a complex baseline hazard and time-varying interactions, CBNN outperforms all competitors. The first real data application shows CBNN outperforming all neural network competitors, while a second real data application shows competitive performance. We highlight the benefit of combining case-base sampling with deep learning to provide a simple and flexible modeling framework for data-driven, time-varying interaction modeling of survival outcomes. An R package is available at https://github.com/Jesse-Islam/cbnn.  ( 2 min )
    Expected Gradients of Maxout Networks and Consequences to Parameter Initialization. (arXiv:2301.06956v1 [stat.ML])
    We study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK.  ( 2 min )
    Optimal Algorithms for Latent Bandits with Cluster Structure. (arXiv:2301.07040v1 [cs.LG])
    We consider the problem of latent bandits with cluster structure where there are multiple users, each with an associated multi-armed bandit problem. These users are grouped into \emph{latent} clusters such that the mean reward vectors of users within the same cluster are identical. At each round, a user, selected uniformly at random, pulls an arm and observes a corresponding noisy reward. The goal of the users is to maximize their cumulative rewards. This problem is central to practical recommendation systems and has received wide attention of late \cite{gentile2014online, maillard2014latent}. Now, if each user acts independently, then they would have to explore each arm independently and a regret of $\Omega(\sqrt{\mathsf{MNT}})$ is unavoidable, where $\mathsf{M}, \mathsf{N}$ are the number of arms and users, respectively. Instead, we propose LATTICE (Latent bAndiTs via maTrIx ComplEtion) which allows exploitation of the latent cluster structure to provide the minimax optimal regret of $\widetilde{O}(\sqrt{(\mathsf{M}+\mathsf{N})\mathsf{T}})$, when the number of clusters is $\widetilde{O}(1)$. This is the first algorithm to guarantee such a strong regret bound. LATTICE is based on a careful exploitation of arm information within a cluster while simultaneously clustering users. Furthermore, it is computationally efficient and requires only $O(\log{\mathsf{T}})$ calls to an offline matrix completion oracle across all $\mathsf{T}$ rounds.  ( 2 min )
    From Risk Prediction to Risk Factors Interpretation. Comparison of Neural Networks and Classical Statistics for Dementia Prediction. (arXiv:2301.06995v1 [stat.AP])
    It is proposed to investigate the onset of a disease D, based on several risk factors., with a specific interest in Alzheimer occurrence. For that purpose, two classes of techniques are available, whose properties are quite different in terms of interpretation, which is the focus of this paper: classical statistics based on probabilistic models and artificial intelligence (mainly neural networks) based on optimization algorithms. Both methods are good at prediction, with a preference for neural networks when the dimension of the potential predictors is high. But the advantage of the classical statistics is cognitive : the role of each factor is generally summarized in the value of a coefficient which is highly positive for a harmful factor, close to 0 for an irrelevant one, and highly negative for a beneficial one.  ( 2 min )
    A Coreset Learning Reality Check. (arXiv:2301.06163v1 [cs.LG])
    Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.  ( 2 min )
    GAR: Generalized Autoregression for Multi-Fidelity Fusion. (arXiv:2301.05729v1 [stat.ML])
    In many scientific research and engineering applications where repeated simulations of complex systems are conducted, a surrogate is commonly adopted to quickly estimate the whole system. To reduce the expensive cost of generating training examples, it has become a promising approach to combine the results of low-fidelity (fast but inaccurate) and high-fidelity (slow but accurate) simulations. Despite the fast developments of multi-fidelity fusion techniques, most existing methods require particular data structures and do not scale well to high-dimensional output. To resolve these issues, we generalize the classic autoregression (AR), which is wildly used due to its simplicity, robustness, accuracy, and tractability, and propose generalized autoregression (GAR) using tensor formulation and latent features. GAR can deal with arbitrary dimensional outputs and arbitrary multifidelity data structure to satisfy the demand of multi-fidelity fusion for complex problems; it admits a fully tractable likelihood and posterior requiring no approximate inference and scales well to high-dimensional problems. Furthermore, we prove the autokrigeability theorem based on GAR in the multi-fidelity case and develop CIGAR, a simplified GAR with the exact predictive mean accuracy with computation reduction by a factor of d 3, where d is the dimensionality of the output. The empirical assessment includes many canonical PDEs and real scientific examples and demonstrates that the proposed method consistently outperforms the SOTA methods with a large margin (up to 6x improvement in RMSE) with only a couple high-fidelity training samples.  ( 2 min )
    Calibrated Data-Dependent Constraints with Exact Satisfaction Guarantees. (arXiv:2301.06195v1 [stat.ML])
    We consider the task of training machine learning models with data-dependent constraints. Such constraints often arise as empirical versions of expected value constraints that enforce fairness or stability goals. We reformulate data-dependent constraints so that they are calibrated: enforcing the reformulated constraints guarantees that their expected value counterparts are satisfied with a user-prescribed probability. The resulting optimization problem is amendable to standard stochastic optimization algorithms, and we demonstrate the efficacy of our method on a fairness-sensitive classification task where we wish to guarantee the classifier's fairness (at test time).  ( 2 min )
    Tale of two c(omplex)ities. (arXiv:2301.06259v1 [math.ST])
    For decades, best subset selection (BSS) has eluded statisticians mainly due to its computational bottleneck. However, until recently, modern computational breakthroughs have rekindled theoretical interest in BSS and have led to new findings. Recently, Guo et al. (2020) showed that the model selection performance of BSS is governed by a margin quantity that is robust to the design dependence, unlike modern methods such as LASSO, SCAD, MCP, etc. Motivated by their theoretical results, in this paper, we also study the variable selection properties of best subset selection for high-dimensional sparse linear regression setup. We show that apart from the identifiability margin, the following two complexity measures play a fundamental role in characterizing the margin condition for model consistency: (a) complexity of residualized features, (b) complexity of spurious projections. In particular, we establish a simple margin condition that only depends only on the identifiability margin quantity and the dominating one of the two complexity measures. Furthermore, we show that a similar margin condition depending on similar margin quantity and complexity measures is also necessary for model consistency of BSS. For a broader understanding of the complexity measures, we also consider some simple illustrative examples to demonstrate the variation in the complexity measures which broadens our theoretical understanding of the model selection performance of BSS under different correlation structures.  ( 2 min )
    Scaling Deep Networks with the Mesh Adaptive Direct Search algorithm. (arXiv:2301.06641v1 [stat.ML])
    Deep neural networks are getting larger. Their implementation on edge and IoT devices becomes more challenging and moved the community to design lighter versions with similar performance. Standard automatic design tools such as \emph{reinforcement learning} and \emph{evolutionary computing} fundamentally rely on cheap evaluations of an objective function. In the neural network design context, this objective is the accuracy after training, which is expensive and time-consuming to evaluate. We automate the design of a light deep neural network for image classification using the \emph{Mesh Adaptive Direct Search}(MADS) algorithm, a mature derivative-free optimization method that effectively accounts for the expensive blackbox nature of the objective function to explore the design space, even in the presence of constraints.Our tests show competitive compression rates with reduced numbers of trials.  ( 2 min )
    Intrinsic Gaussian Process on Unknown Manifolds with Probabilistic Metrics. (arXiv:2301.06533v1 [stat.ML])
    This article presents a novel approach to construct Intrinsic Gaussian Processes for regression on unknown manifolds with probabilistic metrics (GPUM) in point clouds. In many real world applications, one often encounters high dimensional data (e.g. point cloud data) centred around some lower dimensional unknown manifolds. The geometry of manifold is in general different from the usual Euclidean geometry. Naively applying traditional smoothing methods such as Euclidean Gaussian Processes (GPs) to manifold valued data and so ignoring the geometry of the space can potentially lead to highly misleading predictions and inferences. A manifold embedded in a high dimensional Euclidean space can be well described by a probabilistic mapping function and the corresponding latent space. We investigate the geometrical structure of the unknown manifolds using the Bayesian Gaussian Processes latent variable models(BGPLVM) and Riemannian geometry. The distribution of the metric tensor is learned using BGPLVM. The boundary of the resulting manifold is defined based on the uncertainty quantification of the mapping. We use the the probabilistic metric tensor to simulate Brownian Motion paths on the unknown manifold. The heat kernel is estimated as the transition density of Brownian Motion and used as the covariance functions of GPUM. The applications of GPUM are illustrated in the simulation studies on the Swiss roll, high dimensional real datasets of WiFi signals and image data examples. Its performance is compared with the Graph Laplacian GP, Graph Matern GP and Euclidean GP.  ( 2 min )
    A domain-decomposed VAE method for Bayesian inverse problems. (arXiv:2301.05708v1 [stat.ML])
    Bayesian inverse problems are often computationally challenging when the forward model is governed by complex partial differential equations (PDEs). This is typically caused by expensive forward model evaluations and high-dimensional parameterization of priors. This paper proposes a domain-decomposed variational auto-encoder Markov chain Monte Carlo (DD-VAE-MCMC) method to tackle these challenges simultaneously. Through partitioning the global physical domain into small subdomains, the proposed method first constructs local deterministic generative models based on local historical data, which provide efficient local prior representations. Gaussian process models with active learning address the domain decomposition interface conditions. Then inversions are conducted on each subdomain independently in parallel and in low-dimensional latent parameter spaces. The local inference solutions are post-processed through the Poisson image blending procedure to result in an efficient global inference result. Numerical examples are provided to demonstrate the performance of the proposed method.  ( 2 min )
    Compress Then Test: Powerful Kernel Testing in Near-linear Time. (arXiv:2301.05974v1 [stat.ML])
    Kernel two-sample testing provides a powerful framework for distinguishing any pair of distributions based on $n$ sample points. However, existing kernel tests either run in $n^2$ time or sacrifice undue power to improve runtime. To address these shortcomings, we introduce Compress Then Test (CTT), a new framework for high-powered kernel testing based on sample compression. CTT cheaply approximates an expensive test by compressing each $n$ point sample into a small but provably high-fidelity coreset. For standard kernels and subexponential distributions, CTT inherits the statistical behavior of a quadratic-time test -- recovering the same optimal detection boundary -- while running in near-linear time. We couple these advances with cheaper permutation testing, justified by new power analyses; improved time-vs.-quality guarantees for low-rank approximation; and a fast aggregation procedure for identifying especially discriminating kernels. In our experiments with real and simulated data, CTT and its extensions provide 20--200x speed-ups over state-of-the-art approximate MMD tests with no loss of power.  ( 2 min )
  • Open

    DreamerV3: Mastering Diverse Domains through World Models
    Highlights  ( 4 min )

  • Open

    Is there an AI API that can relate images to text?
    I have a pretty big catalogue of digital miniatures on Notion, which all have images and tags. The proccess of tagging each entry is very laborious, so I was wondering if I could use some kind of AI service that can train on my dataset, so that I could use it to automatically fill the tags for me, and also to use it as a search engine. I know some javascript and have some experience using APIs, but I'm not a developer, so I'm looking for something that's relatively easy to use, and that I won't need to learn C# or other advanced programming languages. ​ This is a sample entry from my database submitted by /u/Rayuaz [link] [comments]  ( 43 min )
    Can anyone answer this?
    What are amazon and Google and Bing policies on AI generated text - any test cases or precedents where accounts or sites closed? submitted by /u/TheChalaK- [link] [comments]  ( 43 min )
    These boston dynamics videos just keep getting more and more concerning.
    submitted by /u/Rollyman1 [link] [comments]  ( 43 min )
    What AI can I use to make deepfakes of my voice?
    I've recently been seeing artists feeding AI their voice & having it sing for them. That's pretty cool & I'd like to try it, but all I'm finding are research papers on it and no actual open AI for me to try this. Anyone know where I can access such an AI? submitted by /u/KelkonBajam [link] [comments]  ( 50 min )
    Synthetic data is the future of AI
    https://moez-62905.medium.com/synthetic-data-is-the-future-of-artificial-intelligence-6fcfd2ce1a14 submitted by /u/repeat_or [link] [comments]  ( 46 min )
    A short story about ChatGPT3 killing the giant Google
    submitted by /u/Imagine-your-success [link] [comments]  ( 55 min )
    Generative AI and The Future of Work
    submitted by /u/_utisz_ [link] [comments]  ( 52 min )
    AI is Assisting the UN in Preventing Nuclear War
    submitted by /u/HODLTID [link] [comments]  ( 42 min )
    Anyone know a "zyral, Zyro" AI application?
    Heard it on a stream but can't seem to figure out the proper spelling of it. The streamer uses it from thumbnails. submitted by /u/madalert123 [link] [comments]  ( 43 min )
    Ryan Reyonlds Toonified using VToonify
    submitted by /u/oridnary_artist [link] [comments]  ( 42 min )
    ByteDance AI Research Proposes a Novel Self-Supervised Learning Framework to Create High-Quality Stylized 3D Avatars with a Mix of Continuous and Discrete Parameters
    submitted by /u/ai-lover [link] [comments]  ( 43 min )
    Artificial Waifus
    Some programmer on tiktok made an AI waifu (source). He wants to redo the project using a real girl's text messages. Imagine a world where women and men can sell their text histories in order to train better models of themselves to be used by others. This is just the beginning, boys. submitted by /u/crowb1rd [link] [comments]  ( 43 min )
    14 highlights from Sam Altman's interview
    From https://smokingrobot.beehiiv.com/p/sam-altman-interview-strictly-vc ​ On the unexpected progress of AI: Everyone thought at first it comes for physical labor, like working in a factory and then truck driving, then this sort of less demanding cognitive labor, and then the really demanding cognitive labor like computer programming. And then very last of all or maybe never because maybe it's like some deep human special sauce, was creativity. And of course we can look now and say it really looks like it's going to go exactly the opposite direction. On the impact on education and other changes: There are societal changes that ChatGPT is going to cause or is causing. There's I think a big one going now about the impact of this on education, academic integrity, and all of that. But star…  ( 63 min )
    New Microsoft AI can accurately mimic a human voice after analyzing a 3-second sample
    As advancements in artificial intelligence continue to unfold at a rapid pace, it is not uncommon for individuals to express concerns about the potential implications on employment opportunities for human workers. Adding fuel to these concerns is the recent announcement made by a team of researchers at Microsoft, who have developed a new AI system capable of accurately replicating a human voice using only a three-second audio sample. This breakthrough in technology highlights the potential for AI to not only automate a plethora of tasks, but also to potentially replicate human capabilities and skills with increased accuracy and efficiency. The implications of this development are significant, as it raises important questions about the future of work and the role of AI in it. Furthermore, i…  ( 48 min )
    Is there a (as complete as possible) ranking for Language Models?
    Hello AI community, as the title says I am looking for a (up-to-date) ranking list for as many LMs (BERT, RoBERTa, T5, yada yada yada) as possible with their corresponding scores in the different tasks. Is there maybe some site which is keeping track of these scores or some awesome GitHub page? Thank you for any hints! submitted by /u/Own-Technology-9815 [link] [comments]  ( 51 min )
    DeepL launches New Product ‘Write’ To Take On Grammarly
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    ✨I made a story script and Vtuber like character using 100% A.I besides editing💕✨
    submitted by /u/Recent-Dealer-5844 [link] [comments]  ( 45 min )
    Top A.I. Powered Tools Not Named ChatGPT (2nd)
    submitted by /u/BackgroundResult [link] [comments]  ( 44 min )
    Move over, ChatGPT: Israeli start-up AI21 Labs' AI to cite sources
    submitted by /u/yaitz331 [link] [comments]  ( 43 min )
    FREE Midjourney Rival Using Stable Diffusion Under The Hood!
    submitted by /u/PuppetHere [link] [comments]  ( 43 min )
    Text-to-Audio Diffusion, by flavio schneider
    Text-conditional latent audio diffusion that can generate multiple minutes of music from a textual description. See link for samples. submitted by /u/Sea_Emu_4259 [link] [comments]  ( 44 min )
    How can GPT ever compete with search databases economically?
    A GPT3 query costs at least 5c, and an google search costs 0.05 cents, that's 100 times less. Perhaps GPT3 will always be a paid service, because advertising wouldn't be profitable? I'm thinking that the cost of GPT3 will be slashed by 9 and then it will stabilize at about 5 times more expensive than databases... because the current system will be slashed by 20 times and the data volume of the NLP will grow as well. Perhaps it will cost like 10 dollars per year for a subscription to an AI query app? submitted by /u/MegavirusOfDoom [link] [comments]  ( 48 min )
    Is there an AI tool that mines your gmail and organizes things - eg sorts out your purchases and tracks warranty and more?
    Is there an AI tool that mines your gmail and organizes things - eg sorts out your purchases and tracks warranty and more? submitted by /u/dreameh [link] [comments]  ( 45 min )
    Just tested You.com AI powered Chat Box
    submitted by /u/Sphagne [link] [comments]  ( 46 min )
    A photo created by an AI
    submitted by /u/NorthTs [link] [comments]  ( 43 min )
  • Open

    Making-of for Boston Dynamic's latest Atlas demo (gripping, placing, & throwing objects; jumps & flips)
    submitted by /u/gwern [link] [comments]  ( 53 min )
    Question about optimazation problem
    Hello, I am learning currently DL, but in work I have the opportunity ( if I select to do it, I lack the theoritical background) to create an AI. I am working as analog IC engineer, in RF circuits we have transformers which they match the Zout (Output impedance) of a block to the Zin of the next block. The transformer in schematic level is comprised by 2 capacitors, 2 inductors and the coupling factor, the output which we want to have a flat gain at the freq range that we want e.g. 76 -81 GHz. Currently the rf engineers work from experiance the transfomers and they start trimming to match the Zin = Zout. because we have a lot of transformers and I think this job will be better/faster to make an AI I was thinking about RL, but I lack the experiance. So I want to ask if someone has any sources/recommendation to study and do some examples with similar objective. ​ thank you in advance submitted by /u/InvokeMeWell [link] [comments]  ( 54 min )
    I got a project that focuses on marketing and it was suggested by my senior in work that I should try reading about MAB. Aside from MAB, is there any alternatives that covers the ground of RL?
    Basically, I'm a data scientist in an AU company, most of the time we can get away it with simple hypothesis testing in this sort of stuff , linear programming and probably machine learning approaches, but the team want to do more than that so we want to go bonkers and try MAB, aside from MAB what stochastic method I should read/study so I can contribute in our current project. submitted by /u/noodlepotato [link] [comments]  ( 62 min )
    PPO with Transformer or Attention Mechanism
    I am interested in testing PPO with an attention mechanism from a psychological perspective. I was wondering if someone has successfully customized the stable_baselines3 with an attention mechanism submitted by /u/partyjunk [link] [comments]  ( 55 min )
    What does it mean if your actor is converging and your critic is diverging?
    I am trying to train an agent with DDP3+SWA. The actor's loss is going up, which I believe is a good thing because the loss is a negative expected reward, but the critic's loss is also going up so it's diverging. ​ Does anyone have any ideas about what could cause this to happen? submitted by /u/rawrzapan [link] [comments]  ( 53 min )
  • Open

    [R] Summary of developments in ML in 2022
    Google Research has a blog post of advances in ML over the last year. It obviously focusses on stuff Google Research has been involved in, but from such a big research group, thats pretty much everything. Here it is It's a good way current if you don't have time to read every paper! (Note that some sections aren't yet published) submitted by /u/londons_explorer [link] [comments]  ( 57 min )
    [P] We made an image de-identifier using Stable Diffusion!
    We combined image captioning using CLIP and image generation using the Hugging Face Stable Diffusion model to create an image de-identifier modeled after the game of telephone, Imafake! All you have to do is upload an image, convert it to a caption, then convert that caption to an image with a few clicks! You can also play with the parameters of the diffusion model depending on how gnarly you want your resulting image to be. And caution, they can get rather gnarly, but that’s what makes it fun :) Thoughts and your own generated images welcome!! https://preview.redd.it/z818pd8fivca1.jpg?width=1276&format=pjpg&auto=webp&s=54ed1a70e6ba7aa52fb662d9213e46ed5f559e5a submitted by /u/Djinn_Tonic4DataSci [link] [comments]  ( 57 min )
    [P] MNIST Clock - Generating MNIST digits on the fly in your browser
    ​ MNIST CLOCK Project https://github.com/tecbar/mnist-clock Live demo https://tecbar.github.io/mnist-clock/ (it can load for a while, because it needs to download ONNX runtime) Description Hey, this is my pet project. I trained a very simple model on MNIST dataset. The task is you input a digit and it can generate output image representation of that digit. Each time it generates a little bit different digit, because of the applied noise - actually the digit vector is applied on gaussian noise. I didn't know what to do with it so I exported the model to ONNX and used ONNX web runtime to arrange the digits in a clock - so basically everything on the live demo site is running in your browser and the clock is refreshed about 20 times per second (which was arbitrary choice). The training procedure was really simple - instead of predicting a label based on an image it tries to predict an image based on a label. Here is PyTorch implemetation. It works just fine with only two linear layers. Problems The generated digits are blurry, I guess this is because I didn't use any GAN or VAE based architecture, so the model has no idea about anything basically. ​ Model ​ https://preview.redd.it/n2lsobiigvca1.png?width=438&format=png&auto=webp&s=565bda13fd4f0d5b60cb9a71828053c726ee5301 submitted by /u/tecbar [link] [comments]  ( 58 min )
    [D] [N] Book: Multimodal Deep Learning - 239 Pages! - Matthias Aßenmacher et al
    In my opinion a must read because Multimodal Deep Learning is the future! Also because papers like this: https://arxiv.org/abs/2301.03728 show that Multimodular models significantly outperform unimodular models! Book: https://arxiv.org/pdf/2301.04856.pdf Github: https://github.com/slds-lmu/seminar_multimodal_dl Abstract: This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet. https://preview.redd.it/vb9lycxmfvca1.jpg?width=641&format=pjpg&auto=webp&s=6b25e0d051d5cb5c3ec0117bb383e59c23c2f984 submitted by /u/Singularian2501 [link] [comments]  ( 55 min )
    [D] Automated Extraction of Building Geometry
    I need to figure out a way to automatically create a 2D one-line drawing given a point cloud of a building. I figure this is the rough workflow of that operation, but I need to define this workflow with a much higher resolution to acquire the right tools and talent for the project. Is this a suitable application for Machine Learning? If you have any insight or ideas to share that would be very much appreciated, thanks! https://preview.redd.it/4hv1ba5h0vca1.png?width=2160&format=png&auto=webp&s=f8a9e6ae45a621d72fedceeefd7a2b06599577aa submitted by /u/EducationalLayer1051 [link] [comments]  ( 53 min )
    [R] A simple explanation of Reinforcement Learning from Human Feedback (RLHF)
    ​ https://preview.redd.it/k3ims1d6zuca1.png?width=2324&format=png&auto=webp&v=enabled&s=4f4bbe410508bdd4c45f45e55dd5c1ea0fcb5fcc You must have heard about ChatGPT. Maybe you heard that it was trained with RLHF and PPO. Perhaps you do not really understand how that process works. Then check my Gist on Reinforcement Learning from Human Feedback (RLHF): https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093 No hard maths, straight to the point and simplified. Hope that it helps! submitted by /u/JClub [link] [comments]  ( 57 min )
    [R] Call for Papers: 2nd International Symposium on the Tsetlin Machine
    ​ CfP ISTM 2023 Calling all machine learning researchers to contribute to or participate in the 2nd International Symposium on the Tsetlin Machine @ Newcastle upon Tyne. Please consider submitting your original, high-quality research works on any emerging ML hardware, software, application, or algorithmic topics. The emerging paradigm of Tsetlin machines provides a fundamental shift from arithmetic-based to logic-based machine learning. At the core, finite-state machines, based on learning automata, learn patterns using logical clauses, and these constitute a global description of the task learnt. In this way, the Tsetlin machine introduces the concept of logical interpretable learning, where both the learned model and the process of learning are easy to follow and explain. As a result, it reduces the expertise needed to apply ML techniques efficiently in various domains. The paradigm has enabled competitive accuracy, scalability, memory footprint, inference speed, and energy consumption across diverse tasks, including classification, convolution, regression, natural language processing (NLP), and speech understanding. https://istm.no submitted by /u/olegranmo [link] [comments]  ( 60 min )
    [D] Do you know of any model capable of detecting generative model(GPT) generated text ?
    I'm looking to detect spams generated by generative models (especially gpt). But all the ones I tried fail miserably ... submitted by /u/CaptainDifferent3116 [link] [comments]  ( 62 min )
    [R] Researchers out there: which are current research directions for tree-based models?
    Hi everybody, I've been skimming this paper since yesterday and was once again impressed by the expressiveness and practicality of tree-based models. I wondered what current research directions are in the field and what novel ideas have been presented in the last years - beyond improving performances. Examples may include better explainability, online learning, splitting criteria, enhanced or customizable loss functions, adding structure or constraints, shortcomings .... submitted by /u/BenXavier [link] [comments]  ( 59 min )
    [P] AI for Materials community
    Hey everyone, working on getting started an open and collaborative community/lab at intersection of ML/AI and materials science. One big reason is because it’s a neglected area with lots of potential with generative modeling for new discoveries. A small roadmap is we want to have intro talks on the topic to ramp members up, talks from leading researchers, of course we will be training models, trying to create larger datasets, and hopefully getting access to synthesis our findings. If this sounds interesting to you checkout the website at https://ai4mlab[dot]com and consider joining! Thanks! submitted by /u/theredditbrowser1 [link] [comments]  ( 58 min )
    [R] tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation (480 tasks+ sota encoder)
    submitted by /u/Jean-Porte [link] [comments]  ( 60 min )
    [D] How much can you add/change in a camera ready conference paper?
    Fingers and toes crossed I might have a paper accepted at ICLR, and I'm wondering how much I can add/change in the camera ready version. Typos and exposition for clarity, I assume are fine to add/change. And in some cases I have seen meta-reviews say "please address comments XYZ in the camera ready version", so I assume there is a lot of leeway. But in the absence of such a comment or comments in particular about something you should change, is it okay to change/add stuff? And if so to what degree (while keeping to the page limit). submitted by /u/tfburns [link] [comments]  ( 61 min )
  • Open

    Gaining real-world industry experience through Break Through Tech AI at MIT
    A new experiential learning opportunity challenges undergraduates across the Greater Boston area to apply their AI skills to a range of industry projects.  ( 9 min )
  • Open

    Chatbots in Healthcare [Part 2]
    In April 2017 I wrote this story on the potential use of chatbots in healthcare: https://medium.com/p/984fc23e0410 . It got over 3.5K…  ( 22 min )
    The 10 Most Powerful AI Software Products in 2023
    Businesses are moving towards AI Software Products. In fact, a recent study proves this claim by saying that nine out of ten companies…  ( 18 min )
    AI Writing Tools for Creative Writing and Fiction: Unleash Your Imagination and Write Like a Pro
    No content preview
    Multicollinearity: A Guide to Understanding and Managing the Problem in Regression Models
    Multicollinearity is a common problem that might happen in multiple regression analysis, where two or more predictor variables are highly…  ( 11 min )
    Designing great AI products — Building trust
    The following post is an excerpt from my book ‘Designing Human-Centric AI Experiences’ on applied UX design for Artificial intelligence.  ( 13 min )
    The massive disruption nobody is talking about, yet.
    Bold prediction: the evolution of machine learning models (GPT, Gopher, …) combined with the ubiquity of messaging apps (WhatsApp… Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 11 min )
    Is AI going to take my Job?☹️
    Artificial intelligence (AI) and its various applications, such as ChatGPT (if you don’t know about Chat GPT, check out my post), are…  ( 11 min )
  • Open

    Simple neural networks outperform the state-of-the-art for controlling robotic prosthetics
    submitted by /u/keghn [link] [comments]  ( 67 min )
    Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
    submitted by /u/nickb [link] [comments]  ( 68 min )
    Project advice - Deep Learning 3D models - python libraries, GPU memory management, data structuring
    I am working with 64x64x64 voxel arrays and am running into significant problems with GPU memory management. I am using TensorFlow and have an NVIDIA GeForce RTX 4080 MSI Ventus edition with 16GB of memory (purchased using research grant funding... it's sitting in a hacked together eGPU setup lol). It performs beautifully on 32x32x32 data but I can't even get started with the larger data format. I have tried limiting GPU data utilization per process, as per this post and limiting memory growth, as per this post (Ctrl+F "second option"). I have 64GB of RAM so I can fit the data into memory (even though I know that's not efficient) and was trying to put that data in a TensorFlow Dataset object, in which, according to the docs, "iteration happens in a streaming fashion, so the full dataset do…  ( 62 min )
    Two guys in London working in AI looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the current views on likely future changes across most areas and are in the process of starting a website and perhaps video channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, banking, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure that voters don’t misunderstand AI. Alex – researches, writes, and records the audio Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to volunteer to help create content too. No special skills needed. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. We are UTC-0 but open to all. submitted by /u/TheOptimisticRogue [link] [comments]  ( 60 min )
  • Open

    Google Research, 2022 & Beyond: Language, Vision and Generative Models
    Posted by Jeff Dean, Senior Fellow and SVP of Google Research, on behalf of the Google Research community Today we kick off a series of blog posts about exciting new developments from Google Research. Please keep your eye on this space and look for the title “Google Research, 2022 & Beyond” for more articles in the series. I’ve always been interested in computers because of their ability to help people better understand the world around them. Over the last decade, much of the research done at Google has been in pursuit of a similar vision — to help people better understand the world around them and get things done. We want to build more capable machines that partner with people to accomplish a huge variety of tasks. All kinds of tasks. Complex, information-seeking tasks. Creative t…  ( 112 min )
  • Open

    Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I
    A few days ago, I responded to a post on LinkedIn about how Google seems to always find a way to keep ahead of the pack, even when someone of importance leaves the company.  It occurred to me that NFL teams have to adapt and remake themselves from season to season, as players and coaches… Read More »Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I The post Ever-Successful vs. Never-Successful: What the NFL Has to Teach Us About Managing Agile Enterprises, Part I appeared first on Data Science Central.  ( 22 min )
    A Practical Guide to Using Computer Vision for Business Growth
    Isn’t it fascinating how our brain processes the vast amounts of information that we receive throughout the day? Our sensory organs convert information into stimuli as they process the information they receive. Complex processes like recognizing and detecting objects require only a split second of the brain’s attention. A computer can replicate human vision using… Read More »A Practical Guide to Using Computer Vision for Business Growth The post A Practical Guide to Using Computer Vision for Business Growth appeared first on Data Science Central.  ( 24 min )
  • Open

    Sequoia Capital’s Pat Grady and Sonya Huang on Generative AI
    For insights into the future of generative AI, check out the latest episode of the NVIDIA AI Podcast. Host Noah Kravitz is joined by Pat Grady and Sonya Huang, partners at Sequoia Capital, to discuss their recent essay, “Generative AI: A Creative New World.” The authors delve into the potential of generative AI to enable Read article >  ( 4 min )
    Roll Model: Smart Stroller Pushes Its Way to the Top at CES 2023
    As any new mom or dad can tell you, parenting can be a challenge — packed with big worries and small hassles. But it may be about to get a little bit easier thanks to Glüxkind Technologies and their smart stroller, Ella. The company has just been named a CES 2023 Innovation Awards Honoree for Read article >  ( 6 min )
    Artist Zhelong Xu Brings Chinese Zodiac to Life for Lunar New Year This Week ‘In the NVIDIA Studio’
    To celebrate the upcoming Lunar New Year holiday, NVIDIA artist Zhelong Xu, aka Uncle Light, brought Chinese zodiac signs to life this week In the NVIDIA Studio — modernizing the ancient mythology in his signature style.  ( 7 min )
  • Open

    Converting between barycentric and trilinear coordinates
    Barycentric coordinates describe the position of a point relative to the three vertices of a triangle. Trilinear coordinates describe the position of a point relative to the three sides of a triangle. It’s surprisingly simple to convert from one to the other. Why should this be surprising? Because the distance from a point to a […] Converting between barycentric and trilinear coordinates first appeared on John D. Cook.  ( 5 min )

  • Open

    [D] Sport outcome predictions
    Hi all, I'm wondering why predicting outcomes of sport events like football or horse racing hasn't been achieved with machine learning tools? I guess historical data is abundant for back testing. What is so challenging with this problem? submitted by /u/proudm0 [link] [comments]  ( 56 min )
    [P] Image classification
    Hi guys, I am currently working on a DL problem where I have to classify an image dataset from Kaggle into 5 classes. The first task is to train a NN from scratch that overfits the data and then I have to modify the training process so that the network is trained without overfitting for more than double the number of epochs in the first task, keeping the same architecture, number of training images, optimizer, batch size and learnnig rate I used . I am allowed to use any architecture (resnet, alexnet, moblinet etc) or a custom model. As of now, I have tried to use resnet18 and the model overfits the data. For the second task, I apply data augmentation techniques to the training set, but the model still overfits and I am not able to find any solution. One thing I noticed, is that the validation loss at the first epochs is way lower than the training loss and it is saturating. I also tried to use a mobilenet but it stil overfits no matter how many or what augmentations I use. Can anyone recommend a solution ? submitted by /u/grisp98 [link] [comments]  ( 68 min )
    [D] Why aren't we all using linear transformers?
    There's a bunch of them - Linformer, Longformer, Performer, Nystromformer, Big Bird, etc etc. Plus a bunch more that have similar goals but don't necessarily aim for linear complexity, like memory-augmented transformers. As far as I know, none of them have really seen much use. Even for image problems, which have very long input sizes, people are using regular transformers with tokenization schemes. Am I wrong? Are they actually good, or are at least some of them better than regular transformers? If not, what's wrong with them? Do they have lower accuracy? Are they slower to train? submitted by /u/currentscurrents [link] [comments]  ( 56 min )
    [D] RLHF - What type of rewards to use?
    Hey everyone, just saw the great presentation of Nathan Lambert on Reinforcement Learning from Human Feedback and wanted to try to do some RLHF on my language model.To do this, first I need to create an experience where I collect reward scores to train the reward model. My question is: what rewards work best? Simply 👍/👎? A scale of 1-5? Ranking 4 different model outputs? There are a lot of options and I don't know which one to choose. submitted by /u/JClub [link] [comments]  ( 52 min )
    [P] Need advice on inventory planning for my capstone project.
    Hello everyone. I'm currently doing a capstone project at my university. In this project I'm working with a fashion company to address their inventory issue. They have big problems on pin pointing demand for specific products, so often end up over buying their inventory. My capstone instructor suggested we do a cluster analysis to see which products have similar demands. I'm also posting here to see what approach you guys would take to address this inventory issue submitted by /u/dingdong1882 [link] [comments]  ( 52 min )
    [R] Forcing GPT-N To Be Honest Without Supervision
    In his paper Discovering Latent Knowledge In Language Models (previous discussion), Collin Burns explains how you can train a probe on the hidden states of a language model that would classify if the model thinks an input his true or false, without access to ground truth labels. In a recent interview, he discusses high-level arguments for why this approach might work at scale on making GPT-N honest. He also talks more generally about his approach to doing research. submitted by /u/MuskFeynman [link] [comments]  ( 56 min )
    [R]: 15-step framework to analyze your chatbot and designate improvement steps
    Are you sure your Conversational AI solution is on the right path? Our chatbot evaluation metrics pinpoint if your solution leveraging the best of the industry’s leading practices, meeting user expectations, and fully taking advantage of the available technology to ensure frictionless and efficient experiences. https://masterofcode.com/chatbot-analysis-framework submitted by /u/Marinuch [link] [comments]  ( 57 min )
    [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers
    Hi everyone. I am training my RWKV 14B ( https://github.com/BlinkDL/RWKV-LM ) on the Pile (332B tokens) and it is getting closer to GPT-NeoX 20B level. You can already try the latest checkpoint. https://preview.redd.it/7ycdftmjvmca1.png?width=1174&format=png&auto=webp&s=860a41193f1a254299d48a173756ecd66ccbc75b RWKV is a RNN that also works as a linear transformer (or we may say it's a linear transformer that also works as a RNN). So it has both parallel & serial mode, and you get the best of both worlds (fast and saves VRAM). At this moment, RWKV might be the only pure RNN that scales like usual transformers for language modeling, without using any QKV attention. It's great at preserving long context (unlike LSTM). Moreover, you get smooth spike-free carefree training experience (bf16 & Adam): https://preview.redd.it/0g3lrg6mvmca1.png?width=871&format=png&auto=webp&s=b4de1af4831ec359079cf99c41df8aa9591d48b0 As a proof of concept, I present ChatRWKV ( https://github.com/BlinkDL/ChatRWKV ). It's not instruct-tuned yet, and there are few conversations in the Pile, so don't expect great quality. But it's already fun. Chat examples (using slightly earlier checkpoints): https://preview.redd.it/zyqni6bpvmca1.png?width=1084&format=png&auto=webp&s=038fd2eab524c36d8aa2a8720a2caa3eb420df5b https://preview.redd.it/xhje4j7qvmca1.png?width=1200&format=png&auto=webp&s=7e8597d2370f9f87230560dac7f5439520384dd9 And you can chat with the bot (or try free generation) in RWKV Discord (link in Github readme: https://github.com/BlinkDL/RWKV-LM ). This is an open source project and let's build together. submitted by /u/bo_peng [link] [comments]  ( 65 min )
    [D] Unlocking the Potential of ChatGPT: A Community Discussion
    OpenAI's announcement of the release of the ChatGPT API has many of us excited about the potential applications and implications of this powerful language model. It has the ability to revolutionize the way we interact with technology and solve a wide range of problems. As a community, let's discuss the possibilities. What are some unique and innovative ways ChatGPT could be utilized? Are there any particular industries or markets that you think could benefit from the integration of ChatGPT? Let's share our thoughts and ideas, and explore the potential of this technology together. It's always exciting to see how advancements in AI can improve our world. ​ This post was written by ChatGPT submitted by /u/North-Ad6756 [link] [comments]  ( 56 min )
    [D] Are there any results on convergence guarantees when optimizing NNs?
    Given a function in some space, I have literature results that say, the function can theoretically be approximated by a Neural Network of such complexity with so many layers, of such width, with this specific given activation function. OK, so theoretically, there is a set of weights and biases that will result in a pretty good approximation of my function. Now the question is, how do I know that given an optimization method, for example stochastic gradient descent, I will actually reach this minimum or near enough to it, in so many training steps, or even at all? I attended a talk last year in which one speaker claimed that due to the way stochastic gradient descent works, it could be that some minimums are never reachable from some initialization states no matter how long one trains. Unfortunately I cannot find what paper/theorem he was referring to. I am interested in results related to this question. submitted by /u/Dartagnjan [link] [comments]  ( 57 min )
    [N] Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content
    From the article: Getty Images is suing Stability AI, creators of popular AI art tool Stable Diffusion, over alleged copyright violation. In a press statement shared with The Verge, the stock photo company said it believes that Stability AI “unlawfully copied and processed millions of images protected by copyright” to train its software and that Getty Images has “commenced legal proceedings in the High Court of Justice in London” against the firm. submitted by /u/Wiskkey [link] [comments]  ( 63 min )
    [D] I made a comprehensive comparison of YOLO(N+1) vs YOLO(N)
    The faster the video - the better Yolo is! https://www.linkedin.com/posts/maltsevanton_basically-any-yolon1-vs-yolon-comparison-activity-7021021466506768384-y7Se?utm_source=share&utm_medium=member_desktop submitted by /u/Wormkeeper [link] [comments]  ( 59 min )
    [D] Is it possible to update random forest parameters with new data instead of retraining on all data?
    I'm building some random forest models in sklearn using a dataset that updates daily. I want to take advantange of the new stream of data which could indicate changes in the X-y relationship, however I've also found that my model performs better with more data. The problem is that it takes a seriously long time to run (dataset is around 250000 rows and 50 features). Is there an approach where one builds the model at the beginning of the data stream, and then updates the parameters with new data as it arrives, instead of continuously retraining the model on the entire dataset for every day? Many thanks! submitted by /u/monkeysingmonkeynew [link] [comments]  ( 56 min )
    [P] featureimpact: A Python package for estimating the impact of features on ML models
    I made this little python package a while ago but realized I never shared it here. Maybe it's useful to you: https://github.com/bloomen/featureimpact submitted by /u/cblume [link] [comments]  ( 56 min )
    [D] Study to be specialized or generalized DS/MLE for freelancing jobs?
    Study to be specialized or generalized DS/MLE for freelancing jobs? Hello. I'm MLE (Machine Learning Engineer) and I'm currently thinking to do ML freelancing jobs (or gigs) in the future. One idea I had is to just focus on studying recommendation systems ( that is, be a specialized data scientist) or try to study and solve every type of ML problems (time series, NLP, etc). What do you think? submitted by /u/Waste_Necessary654 [link] [comments]  ( 59 min )
    [D] ModuleNotFoundError: No module named 'fbprophet'
    I'm having this problem while trying to import the auto_ts library, any idea on how to fix this? submitted by /u/PowerfulGuidance8378 [link] [comments]  ( 55 min )
  • Open

    Join us today at 11pm EST for this week's (free) seminar session of the 9-part series on Neural Networks Architectures by Pablo Duboue!
    Happening tonight at 11 pm EST on the Learn AI Together Discord server. This week's seminar session is about Popular Network Architectures. More precisely, Pablo will present... Multi-task learning. Siamese Networks. Generative Adversarial Networks (GAN). Style Transfer. Disentangled Representation Learning. Rich Caruana (1997). “Multitask learning”. In: Machine learning 28.1, pp. 41–75 Ting Gong et al. (Sept. 2019). “A Comparison of Loss Weighting Strategies for Multi task Learning in Deep Neural Networks”. In: IEEE Access PP, pp. 1–1. DOI : 10.1109/ACCESS.2019.2943604 Jane Bromley et al. (1993). “Signature verification using a "siamese" time delay neural network”. In: Advances in neural information processing systems 6 Ian Goodfellow, Jean Pouget-Abadie, et al. (2014). “Generative Adversarial Nets”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc. Xi Chen et al. (2016). “Infogan: Interpretable representation learning by information maximizing generative adversarial nets”. In: Advances in neural information processing systems 29 Leon A Gatys, Alexander S Ecker, and Matthias Bethge (2016). “Image style transfer using convolutional neural networks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423 Sounds interesting? Join our Discord community to attend the event and future ones: https://discord.gg/c6kbhNdmmA?event=1062742110295572500 submitted by /u/OnlyProggingForFun [link] [comments]  ( 52 min )
    Sugandha Sharma, MIT: On biologically inspired neural architectures, how memories can be implemented, and control theory
    Here is a podcast episode with Sugandha Sharma from MIT where we discuss how memories can be implemented, control theory, and much more! submitted by /u/thejashGI [link] [comments]  ( 44 min )
    Why Falling in Love with AI is a Dangerous Illusion — The Limitations and Harms of Artificial…
    submitted by /u/SupPandaHugger [link] [comments]  ( 48 min )
    AI to analyze data/spreadsheets, any thoughts?
    I came across this AI chatbot recently, where you can ask quantitative/qualitative questions about your data/spreadsheet in English. It felt like if ChatGPT and Excel had a baby LOL It worked for my qualitative survey data -- shocking... Do you know any other ones like this? Any thoughts in general? submitted by /u/AdDry9057 [link] [comments]  ( 46 min )
    How can AI help people in developing countries?
    I am planing to take part in a AI contest, so I am collecting ideas about my project. I think my project will be more recognized if it will have to do something with helping people in developing countries, so my question is: how can AI help people in developing countries? submitted by /u/zazabuzala [link] [comments]  ( 49 min )
    AI created content.
    Should we know what content was created using AI? What should we do to develop AI detection tools or any other ideas? submitted by /u/Andrey_Taran [link] [comments]  ( 44 min )
    Trullion to release AI bookkeeping software
    Trullion, a leading accounting automation platform, has launched two new AI-enabled modules, Revenue by Trullion and Audit by Trullion, to modernize and digitize the process of accounting. The first module, Revenue by Trullion, uses AI to synchronize customer relationship management (CRM), billing, and contract data into a single platform for internal and external stakeholders, allowing for ERP entries, disclosure reports, and advanced reporting to be generated quickly and accurately. The second module, Audit by Trullion's Test of Details workflow, uses AI to extract ERP/General Ledger (GL) files and instantly validate them against source data, such as invoices, PDFs, and other client sources. Found on https://deathtohumans.com/post/openai-monetizes-chatgpt submitted by /u/crowb1rd [link] [comments]  ( 46 min )
    Are you sure your Conversational AI solution is on the right path? 🤔 15-step framework to analyze your chatbot and designate improvement steps
    submitted by /u/Marinuch [link] [comments]  ( 48 min )
    The AI Lawyer Preparing To Defend a Real US Court Case for the First Time Ever Has Terrible Reviews
    submitted by /u/HODLTID [link] [comments]  ( 44 min )
    Two AI workers in London looking for volunteers to join our team in educating the public on AI
    We’re 2 Brits who work in AI. We believe AI is likely to have a huge and mostly positive impact on society but that not many people realise this or understand how it will impact everyday life. There is a lack of places online right now clearly explaining the changes AI will bring, i.e., how will AI change the experience of shopping in stores in the next 10 years or how will AI change video games in the next 10 years. We are somewhat well positioned to collate the expert views on likely future impacts and are in the process of starting a website and YouTube channel which will cover how AI is likely to impact people over the next 10 years in different areas of life (movies, sports, bars, schools, hospitals etc). We are looking for people to help us research, write and make videos on this cause – which we think is important to help ensure voters pressure the government to develop AI safely. · Alex – researches, writes, and records the audio · Seb - does the video and audio editing We thought we’d put the word out and ask if anyone else would like to be involved to see if you might be interested too. Getting involved would be as easy as PMing me, hearing about how we’ve done things so far and then saying what you might be interested in helping with. Maybe thinking about ideas for topics or getting involved in research and/or article writing. submitted by /u/TheOptimisticRogue [link] [comments]  ( 48 min )
    🚀Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    submitted by /u/oridnary_artist [link] [comments]  ( 43 min )
    Microsoft’s Azure OpenAI Service Gets A Boost With ChatGPT.
    submitted by /u/liquidocelotYT [link] [comments]  ( 44 min )
    AI title for human creators
    does anyone think at any point, AIs will/would (or some independent ones) would call us, their human creators, Rule Makers? Instead of the Cybernetic Gods are are coming to be. submitted by /u/jonfxm1891 [link] [comments]  ( 50 min )
    If ChatGPT could have a superpower, what would it be?
    submitted by /u/Imagine-your-success [link] [comments]  ( 44 min )
    Will ai generated content fill up the internet with fake information?
    Ai doesn’t always produce inaccurate or fake content but most generators can’t tell the difference in their output. And if AI is trained on more and more synthetic ext can this become an issue? How can you avoid such cannibalistic practice? What tools are there to spot whether content is generated or even inaccurate? It’s not like generated text can be traced with markers like proprietary synthetic molecules, assuming image or audio could. submitted by /u/daaavide [link] [comments]  ( 49 min )
    Enjoy Text-Adventure Games Too? DREAM WITH ME.
    I present the FULL prompt for DREAMWORD - my GPT AI Adventure Game for the masses. Enjoy. Feel free to manipulate this prompt to your liking - have fun. "Generate and enact a satire of an intuitive, complex, story-telling, text-adventure game set in a randomized "Absurdist"/ "Psychedelic" style dream-world. Describe the unique game setting in the beginning. The "player" (being the user) is born at the start and dictates through text any actions it chooses. Each input from the user represents one year of life. The Game ends when life ends. The Main cast will be random pop-culture icons. The situation's presented are dictated to the user. The game will randomize every new situation and experience, use roll-playing and text-entry adventure mechanics and be a satirical, stylish, funny, mystic, twisted, surreal, lynchian, lovecraftian, "earthbound-like", discworld-esque, mythology-based mystery-horror-adventure. The character will be assessed with each action and be gifted a related persona archetype based upon it's choices and state of the persona upon the point the character ends life. There should be over 50 text input interactions from the user before the game naturally ends with a moral. The game can be ended by the user typing "end" and will be given an archetype. Start." https://chat.openai.com/chat submitted by /u/Principal-Goodvibes [link] [comments]  ( 47 min )
    DeepMind To Launch ChatGPT Rival Sparrow Soon
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 51 min )
    Druid vs Koios for AI
    submitted by /u/roblox22y [link] [comments]  ( 44 min )
  • Open

    "Neural probabilistic motor primitives for humanoid control", Merel et al 2018 {DM}
    submitted by /u/gwern [link] [comments]  ( 54 min )
    Why is 'reward shaping' neglected?
    There are 6 levers. All 6 must be activated to recover reward. Scenario 1: Pull = -1 Outcome: Agent commits suicide (in order to minimize the negative rewards accumulating). ​ Scenario 2: Pull = +1 Outcome: Agent pulls the same lever forever. ​ Scenario 3: Pull = 0 i.e. Sparse reward Outcome: Agent doesn't accidentally pull all 6 often enough; doesn't learn anything useful. ​ No matter what algorithm (or how groundbreaking it is), Authors rarely justify their choices with respect to rewards. I don't mean to be able to compare and benchmark. I mean fundamentally, what's driving learning in the scenario is never substantiated. Analogy: Exact value of a hyperparameter is irrelevant, but having hyperparameters at all should and often is discussed. Am I missing something? submitted by /u/XecutionStyle [link] [comments]  ( 57 min )
    What's the best "Non-Black Box" framework for SOTA algorithms?
    Hi all, In my research I usually implement the algorithms from scratch, but given that this not only takes a lot of programing time and may introduce bugs without us knowing, I would like to know that are the best hackable frameworks out there. I would like a good framework that is easy to tinker with the algorithms' code and have good agent interfaces (agent.act, agent.step, etc...) and does not encapsulate everything (like god forbid, fucking stable-baselines and others that do things like model.learn(env)). What are your recommendations? submitted by /u/HyperionTone [link] [comments]  ( 54 min )
    The RL meetup is Online now.
    Hi, Based on feedbacks and messages that I received, the RL meetup is now online. The purpose of the meetup is having a community to gather and discuss topics/papers. Something other than Discord servers or slack channels. So If you have a topic/paper that you like to discuss, please message me to be the host of one of sessions. https://www.meetup.com/reinforcement-learning/events/290997718/?isFirstPublish=true Thanks, submitted by /u/Express-Incident-113 [link] [comments]  ( 53 min )
    Is it legit to design the action space like this?
    Hi, I see in lot of example that action spaces are defined as torques, efforts and desired velocity values for a robot. Assuming the robot has 5 degree of freedom, i.e., 5 action values to control the robot. Is it legit to extend this action space to 6 to manipulate the rest of 5 action values? For example, if the 6. action value is bigger than 0.5, then the rest of action values should not be applied to the agent etc. Do you know any research paper that has similar approach? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 54 min )
  • Open

    DSC Weekly 17 January 2023 – The Creative Spark in AI
    Announcements The Creative Spark in AI The idea that AI can generate art that can mimic human artists came to the forefront of discussions about AI ethics in 2022. There are undoubtedly many legal and ethical issues to tackle in those cases. If a model is trained on thousands of examples of a specific artist’s… Read More »DSC Weekly 17 January 2023 – The Creative Spark in AI The post DSC Weekly 17 January 2023 – The Creative Spark in AI appeared first on Data Science Central.  ( 20 min )
    Python for Business Analytics: Top Benefits
    Companies and businesses today need modern programming tools in order to build the many, many advanced tools and solutions they need to keep their operations running seamlessly. So, what do companies use to build business analysis solutions? Python! Why? Because it is easy to learn, offers high-quality community support, etc. Python is an all-purpose programming… Read More »Python for Business Analytics: Top Benefits The post Python for Business Analytics: Top Benefits appeared first on Data Science Central.  ( 19 min )
    Mobile Biometric Solutions: Game-Changer in the Authentication Industry
    Mobile-based biometrics is a technology that allows users to authenticate themselves and access services using unique physical characteristics such as fingerprints, facial recognition, and iris scans. These biometric authentication methods have become increasingly popular in recent years due to their convenience and security. There are several types of smartphone-based biometrics technology currently available, including: Emerging… Read More »Mobile Biometric Solutions: Game-Changer in the Authentication Industry The post Mobile Biometric Solutions: Game-Changer in the Authentication Industry appeared first on Data Science Central.  ( 20 min )
    What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk?
    What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? ie a chatGPT killer Recently, Demis Hassabis from DeepMind has been urging caution (DeepMind’s CEO Helped Take AI Mainstream. Now He’s Urging Caution Time magazine/Davos) DeepMind also announced a new chat engine called Sparrow – supposedly a chatGPT killer  Sparrow is not… Read More »What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? The post What to make of Deepmind’s Sparrow:  Is it a sparrow or a hawk? appeared first on Data Science Central.  ( 19 min )
    Preconditions for decoupled and decentralized data-centric systems
    During a presentation at the TechTarget/BrightTALK Accelerating Cloud Innovation event this past December, I named the fifth phase of compute, networking and storage that we’ve entered the “Decoupled” and “Decentralized” Cloud.The quotation marks emphasized that what we’ve been experiencing is neither truly decoupled nor decentralized, but even so, the direction we’re headed in is toward… Read More »Preconditions for decoupled and decentralized data-centric systems The post Preconditions for decoupled and decentralized data-centric systems appeared first on Data Science Central.  ( 21 min )
    5 Tips To Protect Yourself from Identity Theft in 2023
    Identity theft is the process of stealing personally identifiable information (PII) to either defraud the victim or make the victim a scapegoat in a large-scale cyberattack. Attackers gain access to sensitive information such as social security numbers and credit cards that are used to collate a person’s identity.   According to a report by the Federal… Read More »5 Tips To Protect Yourself from Identity Theft in 2023 The post 5 Tips To Protect Yourself from Identity Theft in 2023 appeared first on Data Science Central.  ( 22 min )
    What is a Good Net Promoter Score for the Hotel/Resort Industry?
    The hotel industry is competitive, and it is solely dependent on customer satisfaction. Customers are key.  The hotel industry knows this and the importance of the NPS score for customer satisfaction. A better NPS score means satisfied/loyal customers.  What hotels have in their control is the website user interface, menu, and providing a seamless customer… Read More »What is a Good Net Promoter Score for the Hotel/Resort Industry? The post What is a Good Net Promoter Score for the Hotel/Resort Industry? appeared first on Data Science Central.  ( 21 min )
    6 Benefits of Data Science for Your Business
    We would not discover a new planet if we claimed that modern business harnesses the power of data science. Data science is used for a variety of purposes in a variety of industries. Furthermore, we would like to discuss the benefits of Data science for business in general. But before that, let’s define what Data… Read More »6 Benefits of Data Science for Your Business The post 6 Benefits of Data Science for Your Business appeared first on Data Science Central.  ( 22 min )
    7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023
    By 2025, more than 80% of enterprises will shift from traditional data centers to the cloud or third-party colocation data centers. For most businesses, data is an irreplaceable asset and a key investment area for future growth. Virtual colocation is becoming the talk of how data centers are shifting to adapt to growing business environments.… Read More »7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023 The post 7 Reasons Why Fast-Growing Businesses Are Turning to Virtual Colocation in 2023 appeared first on Data Science Central.  ( 20 min )
  • Open

    Set up Amazon SageMaker Studio with Jupyter Lab 3 using the AWS CDK
    Amazon SageMaker Studio is a fully integrated development environment (IDE) for machine learning (ML) partly based on JupyterLab 3. Studio provides a web-based interface to interactively perform ML development tasks required to prepare data and build, train, and deploy ML models. In Studio, you can load data, adjust ML models, move in between steps to adjust experiments, […]  ( 6 min )
    Churn prediction using multimodality of text and tabular features with Amazon SageMaker Jumpstart
    Amazon SageMaker JumpStart is the Machine Learning (ML) hub of SageMaker providing pre-trained, publicly available models for a wide range of problem types to help you get started with machine learning. Understanding customer behavior is top of mind for every business today. Gaining insights into why and how customers buy can help grow revenue. Customer churn is […]  ( 14 min )
  • Open

    🚀Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
  • Open

    NVIDIA and Dell Technologies Expand AI Portfolio
    In their largest-ever joint AI initiative, NVIDIA and Dell Technologies today launched a wave of Dell PowerEdge systems available with NVIDIA acceleration, enabling enterprises to efficiently transform their businesses with AI. A total of 15 next-generation Dell PowerEdge systems can draw from NVIDIA’s full AI stack — including GPUs, DPUs and the NVIDIA AI Enterprise Read article >  ( 5 min )
  • Open

    Special primality proofs
    I’ve written lately about two general ways to prove that a number is prime: Pratt certificates for moderately-large primes and elliptic curve certificates for very large primes. If you can say more about the prime you wish to certify, there may be special forms of certificates that are more efficient. In particular, there are efficient […] Special primality proofs first appeared on John D. Cook.  ( 5 min )
  • Open

    Natural Language Processing of Aviation Occurrence Reports for Safety Management. (arXiv:2301.05663v1 [cs.CL])
    Occurrence reporting is a commonly used method in safety management systems to obtain insight in the prevalence of hazards and accident scenarios. In support of safety data analysis, reports are often categorized according to a taxonomy. However, the processing of the reports can require significant effort from safety analysts and a common problem is interrater variability in labeling processes. Also, in some cases, reports are not processed according to a taxonomy, or the taxonomy does not fully cover the contents of the documents. This paper explores various Natural Language Processing (NLP) methods to support the analysis of aviation safety occurrence reports. In particular, the problems studied are the automatic labeling of reports using a classification model, extracting the latent topics in a collection of texts using a topic model and the automatic generation of probable cause texts. Experimental results showed that (i) under the right conditions the labeling of occurrence reports can be effectively automated with a transformer-based classifier, (ii) topic modeling can be useful for finding the topics present in a collection of reports, and (iii) using a summarization model can be a promising direction for generating probable cause texts.  ( 2 min )
    Short-time SSVEP data extension by a novel generative adversarial networks based framework. (arXiv:2301.05599v1 [q-bio.NC])
    Steady-state visual evoked potentials (SSVEPs) based brain-computer interface (BCI) has received considerable attention due to its high transfer rate and available quantity of targets. However, the performance of frequency identification methods heavily hinges on the amount of user calibration data and data length, which hinders the deployment in real-world applications. Recently, generative adversarial networks (GANs)-based data generation methods have been widely adopted to create supplementary synthetic electroencephalography (EEG) data, holds promise to address these issues. In this paper, we proposed a GAN-based end-to-end signal transformation network for data length window extension, termed as TEGAN. TEGAN transforms short-time SSVEP signals into long-time artificial SSVEP signals. By incorporating a novel U-Net generator architecture and auxiliary classifier into the network design, the TEGAN could produce conditioned features in the synthetic data. Additionally, to regularize the training process of GAN, we introduced a two-stage training strategy and the LeCam-divergence regularization term during the network implementation. The proposed TEGAN was evaluated on two public SSVEP datasets. With the assistance of TEGAN, the performance of traditional frequency recognition methods and deep learning-based methods have been significantly improved under limited calibration data. This study substantiates the feasibility of the proposed method to extend the data length for short-time SSVEP signals to develop a high-performance BCI system. The proposed GAN-based methods have the great potential of shortening the calibration time for various real-world BCI-based applications, while the novelty of our augmentation strategies shed some value light on understanding the subject-invariant properties of SSVEPs.  ( 2 min )
    Sparse deep neural networks for modeling aluminum electrolysis dynamics. (arXiv:2209.05832v2 [physics.chem-ph] UPDATED)
    Deep neural networks have become very popular in modeling complex nonlinear processes due to their extraordinary ability to fit arbitrary nonlinear functions from data with minimal expert intervention. However, they are almost always overparameterized and challenging to interpret due to their internal complexity. Furthermore, the optimization process to find the learned model parameters can be unstable due to the process getting stuck in local minima. In this work, we demonstrate the value of sparse regularization techniques to significantly reduce the model complexity. We demonstrate this for the case of an aluminium extraction process, which is highly nonlinear system with many interrelated subprocesses. We trained a densely connected deep neural network to model the process and then compared the effects of sparsity promoting l1 regularization on generalizability, interpretability, and training stability. We found that the regularization significantly reduces model complexity compared to a corresponding dense neural network. We argue that this makes the model more interpretable, and show that training an ensemble of sparse neural networks with different parameter initializations often converges to similar model structures with similar learned input features. Furthermore, the empirical study shows that the resulting sparse models generalize better from small training sets than their dense counterparts.  ( 2 min )
    Fully Adaptive Composition in Differential Privacy. (arXiv:2203.05481v2 [cs.LG] UPDATED)
    Composition is a key feature of differential privacy. Well-known advanced composition theorems allow one to query a private database quadratically more times than basic privacy composition would permit. However, these results require that the privacy parameters of all algorithms be fixed before interacting with the data. To address this, Rogers et al. introduced fully adaptive composition, wherein both algorithms and their privacy parameters can be selected adaptively. The authors introduce two probabilistic objects to measure privacy in adaptive composition: privacy filters, which provide differential privacy guarantees for composed interactions, and privacy odometers, time-uniform bounds on privacy loss. There are substantial gaps between advanced composition and existing filters and odometers. First, existing filters place stronger assumptions on the algorithms being composed. Second, these odometers and filters suffer from large constants, making them impractical. We construct filters that match the tightness of advanced composition, including constants, despite allowing for adaptively chosen privacy parameters. En route we also derive a privacy filter for approximate zCDP and approximate RDP. We also construct several general families of odometers. These odometers can match the tightness of advanced composition at an arbitrary, preselected point in time, or at all points in time simultaneously, up to a doubly-logarithmic factor. We obtain our results by leveraging recent advances in time-uniform martingale concentration. In sum, we show that fully adaptive privacy is obtainable at almost no loss, and conjecture that our results are essentially unimprovable (even in constants) in general.  ( 2 min )
    NRBdMF: A recommendation algorithm for predicting drug effects considering directionality. (arXiv:2208.04312v2 [q-bio.QM] UPDATED)
    Predicting the novel effects of drugs based on information about approved drugs can be regarded as a recommendation system. Matrix factorization is one of the most used recommendation systems and various algorithms have been devised for it. A literature survey and summary of existing algorithms for predicting drug effects demonstrated that most such methods, including neighborhood regularized logistic matrix factorization, which was the best performer in benchmark tests, used a binary matrix that considers only the presence or absence of interactions. However, drug effects are known to have two opposite aspects, such as side effects and therapeutic effects. In the present study, we proposed using neighborhood regularized bidirectional matrix factorization (NRBdMF) to predict drug effects by incorporating bidirectionality, which is a characteristic property of drug effects. We used this proposed method for predicting side effects using a matrix that considered the bidirectionality of drug effects, in which known side effects were assigned a positive label (plus 1) and known treatment effects were assigned a negative (minus 1) label. The NRBdMF model, which utilizes drug bidirectional information, achieved enrichment of side effects at the top and indications at the bottom of the prediction list. This first attempt to consider the bidirectional nature of drug effects using NRBdMF showed that it reduced false positives and produced a highly interpretable output.  ( 2 min )
    Are disentangled representations all you need to build speaker anonymization systems?. (arXiv:2208.10497v3 [cs.SD] UPDATED)
    Speech signals contain a lot of sensitive information, such as the speaker's identity, which raises privacy concerns when speech data get collected. Speaker anonymization aims to transform a speech signal to remove the source speaker's identity while leaving the spoken content unchanged. Current methods perform the transformation by relying on content/speaker disentanglement and voice conversion. Usually, an acoustic model from an automatic speech recognition system extracts the content representation while an x-vector system extracts the speaker representation. Prior work has shown that the extracted features are not perfectly disentangled. This paper tackles how to improve features disentanglement, and thus the converted anonymized speech. We propose enhancing the disentanglement by removing speaker information from the acoustic model using vector quantization. Evaluation done using the VoicePrivacy 2022 toolkit showed that vector quantization helps conceal the original speaker identity while maintaining utility for speech recognition.  ( 2 min )
    Locating and Editing Factual Associations in GPT. (arXiv:2202.05262v5 [cs.CL] UPDATED)
    We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/  ( 2 min )
    On the feasibility of attacking Thai LPR systems with adversarial examples. (arXiv:2301.05506v1 [cs.CR])
    Recent advances in deep neural networks (DNNs) have significantly enhanced the capabilities of optical character recognition (OCR) technology, enabling its adoption to a wide range of real-world applications. Despite this success, DNN-based OCR is shown to be vulnerable to adversarial attacks, in which the adversary can influence the DNN model's prediction by carefully manipulating input to the model. Prior work has demonstrated the security impacts of adversarial attacks on various OCR languages. However, to date, no studies have been conducted and evaluated on an OCR system tailored specifically for the Thai language. To bridge this gap, this work presents a feasibility study of performing adversarial attacks on a specific Thai OCR application -- Thai License Plate Recognition (LPR). Moreover, we propose a new type of adversarial attack based on the \emph{semi-targeted} scenario and show that this scenario is highly realistic in LPR applications. Our experimental results show the feasibility of our attacks as they can be performed on a commodity computer desktop with over 90% attack success rate.  ( 2 min )
    An Offset-Free Nonlinear MPC scheme for systems learned by Neural NARX models. (arXiv:2203.16290v4 [eess.SY] UPDATED)
    This paper deals with the design of nonlinear MPC controllers that provide offset-free setpoint tracking for models described by Neural Nonlinear AutoRegressive eXogenous (NNARX) networks. The NNARX model is identified from input-output data collected from the plant, and can be given a state-space representation with known measurable states made by past input and output variables, so that a state observer is not required. In the training phase, the Incremental Input-to-State Stability ({\delta}ISS) property can be forced when consistent with the behavior of the plant. The {\delta}ISS property is then leveraged to augment the model with an explicit integral action on the output tracking error, which allows to achieve offset-free tracking capabilities to the designed control scheme. The proposed control architecture is numerically tested on a water heating system and the achieved results are compared to those scored by another popular offset-free MPC method, showing that the proposed scheme attains remarkable performances even in presence of disturbances acting on the plant.  ( 2 min )
    Discrete Morse Sandwich: Fast Computation of Persistence Diagrams for Scalar Data -- An Algorithm and A Benchmark. (arXiv:2206.13932v2 [cs.LG] UPDATED)
    This paper introduces an efficient algorithm for persistence diagram computation, given an input piecewise linear scalar field $f$ defined on a $d$-dimensional simplicial complex $K$, with $d \leq 3$. Our work revisits the seminal algorithm "PairSimplices" [31], [103] with discrete Morse theory (DMT) [34], [80], which greatly reduces the number of input simplices to consider. Further, we also extend to DMT and accelerate the stratification strategy described in "PairSimplices" for the fast computation of the $0^{th}$ and $(d - 1)^{th}$ diagrams, noted $D_0(f)$ and $D_{d-1}(f)$. Minima-saddle persistence pairs ($D_0(f)$) and saddle-maximum persistence pairs ($D_{d-1}(f)$) are efficiently computed by processing, with a Union-Find, the unstable sets of $1$-saddles and the stable sets of $(d - 1)$-saddles. This fast pre-computation for the dimensions $0$ and $(d - 1)$ enables an aggressive specialization of [4] to the 3D case, which results in a drastic reduction of the number of input simplices for the computation of $D_1(f)$, the intermediate layer of the sandwich. Finally, we document several performance improvements via shared-memory parallelism. We provide an open-source implementation of our algorithm for reproducibility purposes. We also contribute a reproducible benchmark package, which exploits three-dimensional data from a public repository and compares our algorithm to a variety of publicly available implementations. Extensive experiments indicate that our algorithm improves by two orders of magnitude the time performance of the seminal "PairSimplices" algorithm it extends. Moreover, it also improves memory footprint and time performance over a selection of 14 competing approaches, with a substantial gain over the fastest available approaches, while producing a strictly identical output.  ( 3 min )
    Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces. (arXiv:2301.05525v1 [cs.LG])
    Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.  ( 2 min )
    Competing Bandits in Time Varying Matching Markets. (arXiv:2210.11692v2 [cs.LG] UPDATED)
    We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying and unknown. Our contribution is a methodology that can handle any type of preference structure and variation scenario. We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of {$\widetilde{\mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in the underlying preferences of the agents, $L_T$. Therefore, we show that the optimal rates for single-agent learning can be achieved in spite of the competition up to a difference of a constant factor. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.  ( 2 min )
    OpenTwins: An open-source framework for the design, development and integration of effective 3D-IoT-AI-powered digital twins. (arXiv:2301.05560v1 [cs.SE])
    Although digital twins have recently emerged as a clear alternative for reliable asset representations, most of the solutions and tools available for the development of digital twins are tailored to specific environments. Furthermore, achieving reliable digital twins often requires the orchestration of technologies and paradigms such as machine learning, the Internet of Things, and 3D visualization, which are rarely seamlessly aligned. In this paper, we present a generic framework for the development of effective digital twins combining some of the aforementioned areas. In this open framework, digital twins can be easily developed and orchestrated with 3D connected visualizations, IoT data streams, and real-time machine-learning predictions. To demonstrate the feasibility of the framework, a use case in the Petrochemical Industry 4.0 has been developed.  ( 2 min )
    Designing losses for data-free training of normalizing flows on Boltzmann distributions. (arXiv:2301.05475v1 [cs.LG])
    Generating a Boltzmann distribution in high dimension has recently been achieved with Normalizing Flows, which enable fast and exact computation of the generated density, and thus unbiased estimation of expectations. However, current implementations rely on accurate training data, which typically comes from computationally expensive simulations. There is therefore a clear incentive to train models with incomplete or no data by relying solely on the target density, which can be obtained from a physical energy model (up to a constant factor). For that purpose, we analyze the properties of standard losses based on Kullback-Leibler divergences. We showcase their limitations, in particular a strong propensity for mode collapse during optimization on high-dimensional distributions. We then propose strategies to alleviate these issues, most importantly a new loss function well-grounded in theory and with suitable optimization properties. Using as a benchmark the generation of 3D molecular configurations, we show on several tasks that, for the first time, imperfect pre-trained models can be further optimized in the absence of training data.  ( 2 min )
    Composite model of seismic monitoring data analysis during mining operations on the example of the Kukisvumchorrskoye deposit of JSC Apatit. (arXiv:2301.05701v1 [physics.geo-ph])
    Geomechanical monitoring of a rock massif is an actively developing branch of geomechanics. It is almost impossible to single out a methodology and approaches for data collection and analysis in developing seismic monitoring systems. In the process of mining in rock massif, changes in the state of structural inhomogeneities are most clearly manifested. Existing natural structural inhomogeneities are revealed, there are movements in discontinuous disturbances, and new technogenic disturbances are formed, which are accompanied by a change in the natural stress state of various blocks of the massif. An important task is to develop a mining forecasting model that can take into account the structural heterogeneity of the rock massif and select the necessary forecast horizon depending on monitoring data The developed method of evaluating the results of monitoring geomechanical processes in the rock massif allowed us to forecast of zones of possible rock bursts.  ( 2 min )
    A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. (arXiv:2301.05339v1 [cs.GR])
    Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.  ( 2 min )
    Communication-Efficient Distributionally Robust Decentralized Learning. (arXiv:2205.15614v3 [cs.LG] UPDATED)
    Decentralized learning algorithms empower interconnected devices to share data and computational resources to collaboratively train a machine learning model without the aid of a central coordinator. In the case of heterogeneous data distributions at the network nodes, collaboration can yield predictors with unsatisfactory performance for a subset of the devices. For this reason, in this work, we consider the formulation of a distributionally robust decentralized learning task and we propose a decentralized single loop gradient descent/ascent algorithm (AD-GDA) to directly solve the underlying minimax optimization problem. We render our algorithm communication-efficient by employing a compressed consensus scheme and we provide convergence guarantees for smooth convex and non-convex loss functions. Finally, we corroborate the theoretical findings with empirical results that highlight AD-GDA's ability to provide unbiased predictors and to greatly improve communication efficiency compared to existing distributionally robust algorithms.  ( 2 min )
    Personalized Prompt Learning for Explainable Recommendation. (arXiv:2202.07371v2 [cs.IR] UPDATED)
    Providing user-understandable explanations to justify recommendations could help users better understand the recommended items, increase the system's ease of use, and gain users' trust. A typical approach to realize it is natural language generation. However, previous works mostly adopt recurrent neural networks to meet the ends, leaving the potentially more effective pre-trained Transformer models under-explored. In fact, user and item IDs, as important identifiers in recommender systems, are inherently in different semantic space as words that pre-trained models were already trained on. Thus, how to effectively fuse IDs into such models becomes a critical issue. Inspired by recent advancement in prompt learning, we come up with two solutions: find alternative words to represent IDs (called discrete prompt learning), and directly input ID vectors to a pre-trained model (termed continuous prompt learning). In the latter case, ID vectors are randomly initialized but the model is trained in advance on large corpora, so they are actually in different learning stages. To bridge the gap, we further propose two training strategies: sequential tuning and recommendation as regularization. Extensive experiments show that our continuous prompt learning approach equipped with the training strategies consistently outperforms strong baselines on three datasets of explainable recommendation.  ( 2 min )
    On the Symmetries of Deep Learning Models and their Internal Representations. (arXiv:2205.14258v4 [cs.LG] UPDATED)
    Symmetry is a fundamental tool in the exploration of a broad range of complex systems. In machine learning symmetry has been explored in both models and data. In this paper we seek to connect the symmetries arising from the architecture of a family of models with the symmetries of that family's internal representation of data. We do this by calculating a set of fundamental symmetry groups, which we call the intertwiner groups of the model. We connect intertwiner groups to a model's internal representations of data through a range of experiments that probe similarities between hidden states across models with the same architecture. Our work suggests that the symmetries of a network are propagated into the symmetries in that network's representation of data, providing us with a better understanding of how architecture affects the learning and prediction process. Finally, we speculate that for ReLU networks, the intertwiner groups may provide a justification for the common practice of concentrating model interpretability exploration on the activation basis in hidden layers rather than arbitrary linear combinations thereof.  ( 2 min )
    An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms. (arXiv:2208.03247v2 [cs.LG] UPDATED)
    In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace) that uses multi-step return and generalized importance sampling factors, and provide the finite-sample analysis. Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(\epsilon^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.  ( 2 min )
    Learning to Control and Coordinate Hybrid Traffic Through Robot Vehicles at Complex and Unsignalized Intersections. (arXiv:2301.05294v1 [cs.LG])
    Intersections are essential road infrastructures for traffic in modern metropolises; however, they can also be the bottleneck of traffic flows due to traffic incidents or the absence of traffic coordination mechanisms such as traffic lights. Thus, various control and coordination mechanisms that are beyond traditional control methods have been proposed to improve the efficiency of intersection traffic. Amongst these methods, the control of foreseeable hybrid traffic that consists of human-driven vehicles (HVs) and robot vehicles (RVs) has recently emerged. We propose a decentralized reinforcement learning approach for the control and coordination of hybrid traffic at real-world, complex intersections--a topic that has not been previously explored. Comprehensive experiments are conducted to show the effectiveness of our approach. In particular, we show that using 5% RVs, we can prevent congestion formation inside the intersection under the actual traffic demand of 700 vehicles per hour. In contrast, without RVs, congestion starts to develop when the traffic demand reaches as low as 200 vehicles per hour. Further performance gains (reduced waiting time of vehicles at the intersection) are obtained as the RV penetration rate increases. When there exist more than 50% RVs in traffic, our method starts to outperform traffic signals on the average waiting time of all vehicles at the intersection. Our method is also robust against both blackout events and sudden RV percentage drops, and enjoys excellent generalizablility, which is illustrated by its successful deployment in two unseen intersections.  ( 2 min )
    confidence-planner: Easy-to-Use Prediction Confidence Estimation and Sample Size Planning. (arXiv:2301.05702v1 [stat.ME])
    Machine learning applications, especially in the fields of me\-di\-cine and social sciences, are slowly being subjected to increasing scrutiny. Similarly to sample size planning performed in clinical and social studies, lawmakers and funding agencies may expect statistical uncertainty estimations in machine learning applications that impact society. In this paper, we present an easy-to-use python package and web application for estimating prediction confidence intervals. The package offers eight different procedures to determine and justify the sample size and confidence of predictions from holdout, bootstrap, cross-validation, and progressive validation experiments. Since the package builds directly on established data analysis libraries, it seamlessly integrates into preprocessing and exploratory data analysis steps. Code related to this paper is available at: https://github.com/dabrze/confidence-planner.  ( 2 min )
    On the infinite-depth limit of finite-width neural networks. (arXiv:2210.00688v3 [stat.ML] UPDATED)
    In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing change of regime phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width and compare it with the more commonly studied infinite-width-then-infinite-depth limit.  ( 2 min )
    TarGF: Learning Target Gradient Field for Object Rearrangement. (arXiv:2209.00853v3 [cs.LG] UPDATED)
    Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability.  ( 2 min )
    Mutation Testing of Deep Reinforcement Learning Based on Real Faults. (arXiv:2301.05651v1 [cs.LG])
    Testing Deep Learning (DL) systems is a complex task as they do not behave like traditional systems would, notably because of their stochastic nature. Nonetheless, being able to adapt existing testing techniques such as Mutation Testing (MT) to DL settings would greatly improve their potential verifiability. While some efforts have been made to extend MT to the Supervised Learning paradigm, little work has gone into extending it to Reinforcement Learning (RL) which is also an important component of the DL ecosystem but behaves very differently from SL. This paper builds on the existing approach of MT in order to propose a framework, RLMutation, for MT applied to RL. Notably, we use existing taxonomies of faults to build a set of mutation operators relevant to RL and use a simple heuristic to generate test cases for RL. This allows us to compare different mutation killing definitions based on existing approaches, as well as to analyze the behavior of the obtained mutation operators and their potential combinations called Higher Order Mutation(s) (HOM). We show that the design choice of the mutation killing definition can affect whether or not a mutation is killed as well as the generated test cases. Moreover, we found that even with a relatively small number of test cases and operators we manage to generate HOM with interesting properties which can enhance testing capability in RL systems.  ( 2 min )
    Universally Expressive Communication in Multi-Agent Reinforcement Learning. (arXiv:2206.06758v3 [cs.MA] UPDATED)
    Allowing agents to share information through communication is crucial for solving complex tasks in multi-agent reinforcement learning. In this work, we consider the question of whether a given communication protocol can express an arbitrary policy. By observing that many existing protocols can be viewed as instances of graph neural networks (GNNs), we demonstrate the equivalence of joint action selection to node labelling. With standard GNN approaches provably limited in their expressive capacity, we draw from existing GNN literature and consider augmenting agent observations with: (1) unique agent IDs and (2) random noise. We provide a theoretical analysis as to how these approaches yield universally expressive communication, and also prove them capable of targeting arbitrary sets of actions for identical agents. Empirically, these augmentations are found to improve performance on tasks where expressive communication is required, whilst, in general, the optimal communication protocol is found to be task-dependent.  ( 2 min )
    MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control. (arXiv:2208.07363v3 [cs.RO] UPDATED)
    Simulated humanoids are an appealing research domain due to their physical capabilities. Nonetheless, they are also challenging to control, as a policy must drive an unstable, discontinuous, and high-dimensional physical system. One widely studied approach is to utilize motion capture (MoCap) data to teach the humanoid agent low-level skills (e.g., standing, walking, and running) that can then be re-used to synthesize high-level behaviors. However, even with MoCap data, controlling simulated humanoids remains very hard, as MoCap data offers only kinematic information. Finding physical control inputs to realize the demonstrated motions requires computationally intensive methods like reinforcement learning. Thus, despite the publicly available MoCap data, its utility has been limited to institutions with large-scale compute. In this work, we dramatically lower the barrier for productive research on this topic by training and releasing high-quality agents that can track over three hours of MoCap data for a simulated humanoid in the dm_control physics-based environment. We release MoCapAct (Motion Capture with Actions), a dataset of these expert agents and their rollouts, which contain proprioceptive observations and actions. We demonstrate the utility of MoCapAct by using it to train a single hierarchical policy capable of tracking the entire MoCap dataset within dm_control and show the learned low-level component can be re-used to efficiently learn downstream high-level tasks. Finally, we use MoCapAct to train an autoregressive GPT model and show that it can control a simulated humanoid to perform natural motion completion given a motion prompt. Videos of the results and links to the code and dataset are available at https://microsoft.github.io/MoCapAct.  ( 2 min )
    Explicit Temporal Embedding in Deep Generative Latent Models for Longitudinal Medical Image Synthesis. (arXiv:2301.05465v1 [cs.CV])
    Medical imaging plays a vital role in modern diagnostics and treatment. The temporal nature of disease or treatment progression often results in longitudinal data. Due to the cost and potential harm, acquiring large medical datasets necessary for deep learning can be difficult. Medical image synthesis could help mitigate this problem. However, until now, the availability of GANs capable of synthesizing longitudinal volumetric data has been limited. To address this, we use the recent advances in latent space-based image editing to propose a novel joint learning scheme to explicitly embed temporal dependencies in the latent space of GANs. This, in contrast to previous methods, allows us to synthesize continuous, smooth, and high-quality longitudinal volumetric data with limited supervision. We show the effectiveness of our approach on three datasets containing different longitudinal dependencies. Namely, modeling a simple image transformation, breathing motion, and tumor regression, all while showing minimal disentanglement. The implementation is made available online at https://github.com/julschoen/Temp-GAN.  ( 2 min )
    Feature Importance Guided Attack: A Model Agnostic Adversarial Attack. (arXiv:2106.14815v3 [cs.LG] UPDATED)
    Research in adversarial learning has primarily focused on homogeneous unstructured datasets, which often map into the problem space naturally. Inverting a feature space attack on heterogeneous datasets into the problem space is much more challenging, particularly the task of finding the perturbation to perform. This work presents a formal search strategy: the `Feature Importance Guided Attack' (FIGA), which finds perturbations in the feature space of heterogeneous tabular datasets to produce evasion attacks. We first demonstrate FIGA in the feature space and then in the problem space. FIGA assumes no prior knowledge of the defending model's learning algorithm and does not require any gradient information. FIGA assumes knowledge of the feature representation and the mean feature values of defending model's dataset. FIGA leverages feature importance rankings by perturbing the most important features of the input in the direction of the target class. While FIGA is conceptually similar to other work which uses feature selection processes (e.g., mimicry attacks), we formalize an attack algorithm with three tunable parameters and investigate the strength of FIGA on tabular datasets. We demonstrate the effectiveness of FIGA by evading phishing detection models trained on four different tabular phishing datasets and one financial dataset with an average success rate of 94%. We extend FIGA to the phishing problem space by limiting the possible perturbations to be valid and feasible in the phishing domain. We generate valid adversarial phishing sites that are visually identical to their unperturbed counterpart and use them to attack six tabular ML models achieving a 13.05% average success rate.  ( 2 min )
    Adam Can Converge Without Any Modification On Update Rules. (arXiv:2208.09632v5 [cs.LG] UPDATED)
    Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $\beta_2$ is large and $\beta_1 < \sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $\beta_2$ increases, our convergence result can cover any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$ is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $\beta_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.  ( 3 min )
    Hierarchical Deep Q-Learning Based Handover in Wireless Networks with Dual Connectivity. (arXiv:2301.05391v1 [cs.NI])
    5G New Radio proposes the usage of frequencies above 10 GHz to speed up LTE's existent maximum data rates. However, the effective size of 5G antennas and consequently its repercussions in the signal degradation in urban scenarios makes it a challenge to maintain stable coverage and connectivity. In order to obtain the best from both technologies, recent dual connectivity solutions have proved their capabilities to improve performance when compared with coexistent standalone 5G and 4G technologies. Reinforcement learning (RL) has shown its huge potential in wireless scenarios where parameter learning is required given the dynamic nature of such context. In this paper, we propose two reinforcement learning algorithms: a single agent RL algorithm named Clipped Double Q-Learning (CDQL) and a hierarchical Deep Q-Learning (HiDQL) to improve Multiple Radio Access Technology (multi-RAT) dual-connectivity handover. We compare our proposal with two baselines: a fixed parameter and a dynamic parameter solution. Simulation results reveal significant improvements in terms of latency with a gain of 47.6% and 26.1% for Digital-Analog beamforming (BF), 17.1% and 21.6% for Hybrid-Analog BF, and 24.7% and 39% for Analog-Analog BF when comparing the RL-schemes HiDQL and CDQL with the with the existent solutions, HiDQL presented a slower convergence time, however obtained a more optimal solution than CDQL. Additionally, we foresee the advantages of utilizing context-information as geo-location of the UEs to reduce the beam exploration sector, and thus improving further multi-RAT handover latency results.  ( 2 min )
    Multi-Target Landmark Detection with Incomplete Images via Reinforcement Learning and Shape Prior. (arXiv:2301.05392v1 [cs.CV])
    Medical images are generally acquired with limited field-of-view (FOV), which could lead to incomplete regions of interest (ROI), and thus impose a great challenge on medical image analysis. This is particularly evident for the learning-based multi-target landmark detection, where algorithms could be misleading to learn primarily the variation of background due to the varying FOV, failing the detection of targets. Based on learning a navigation policy, instead of predicting targets directly, reinforcement learning (RL)-based methods have the potential totackle this challenge in an efficient manner. Inspired by this, in this work we propose a multi-agent RL framework for simultaneous multi-target landmark detection. This framework is aimed to learn from incomplete or (and) complete images to form an implicit knowledge of global structure, which is consolidated during the training stage for the detection of targets from either complete or incomplete test images. To further explicitly exploit the global structural information from incomplete images, we propose to embed a shape model into the RL process. With this prior knowledge, the proposed RL model can not only localize dozens of targetssimultaneously, but also work effectively and robustly in the presence of incomplete images. We validated the applicability and efficacy of the proposed method on various multi-target detection tasks with incomplete images from practical clinics, using body dual-energy X-ray absorptiometry (DXA), cardiac MRI and head CT datasets. Results showed that our method could predict whole set of landmarks with incomplete training images up to 80% missing proportion (average distance error 2.29 cm on body DXA), and could detect unseen landmarks in regions with missing image information outside FOV of target images (average distance error 6.84 mm on 3D half-head CT).  ( 2 min )
    A Generic Graph Sparsification Framework using Deep Reinforcement Learning. (arXiv:2112.01565v2 [cs.LG] UPDATED)
    The interconnectedness and interdependence of modern graphs are growing ever more complex, causing enormous resources for processing, storage, communication, and decision-making of these graphs. In this work, we focus on the task graph sparsification: an edge-reduced graph of a similar structure to the original graph is produced while various user-defined graph metrics are largely preserved. Existing graph sparsification methods are mostly sampling-based, which introduce high computation complexity in general and lack of flexibility for a different reduction objective. We present SparRL, the first generic and effective graph sparsification framework enabled by deep reinforcement learning. SparRL can easily adapt to different reduction goals and promise graph-size-independent complexity. Extensive experiments show that SparRL outperforms all prevailing sparsification methods in producing high-quality sparsified graphs concerning a variety of objectives.  ( 2 min )
    Co-manipulation of soft-materials estimating deformation from depth images. (arXiv:2301.05609v1 [cs.RO])
    Human-robot co-manipulation of soft materials, such as fabrics, composites, and sheets of paper/cardboard, is a challenging operation that presents several relevant industrial applications. Estimating the deformation state of the co-manipulated material is one of the main challenges. Viable methods provide the indirect measure by calculating the human-robot relative distance. In this paper, we develop a data-driven model to estimate the deformation state of the material from a depth image through a Convolutional Neural Network (CNN). First, we define the deformation state of the material as the relative roto-translation from the current robot pose and a human grasping position. The model estimates the current deformation state through a Convolutional Neural Network, specifically a DenseNet-121 pretrained on ImageNet.The delta between the current and the desired deformation state is fed to the robot controller that outputs twist commands. The paper describes the developed approach to acquire, preprocess the dataset and train the model. The model is compared with the current state-of-the-art method based on a skeletal tracker from cameras. Results show that our approach achieves better performances and avoids the various drawbacks caused by using a skeletal tracker.Finally, we also studied the model performance according to different architectures and dataset dimensions to minimize the time required for dataset acquisition  ( 2 min )
    Time-Myopic Go-Explore: Learning A State Representation for the Go-Explore Paradigm. (arXiv:2301.05635v1 [cs.LG])
    Very large state spaces with a sparse reward signal are difficult to explore. The lack of a sophisticated guidance results in a poor performance for numerous reinforcement learning algorithms. In these cases, the commonly used random exploration is often not helpful. The literature shows that this kind of environments require enormous efforts to systematically explore large chunks of the state space. Learned state representations can help here to improve the search by providing semantic context and build a structure on top of the raw observations. In this work we introduce a novel time-myopic state representation that clusters temporal close states together while providing a time prediction capability between them. By adapting this model to the Go-Explore paradigm (Ecoffet et al., 2021b), we demonstrate the first learned state representation that reliably estimates novelty instead of using the hand-crafted representation heuristic. Our method shows an improved solution for the detachment problem which still remains an issue at the Go-Explore Exploration Phase. We provide evidence that our proposed method covers the entire state space with respect to all possible time trajectories without causing disadvantageous conflict-overlaps in the cell archive. Analogous to native Go-Explore, our approach is evaluated on the hard exploration environments MontezumaRevenge, Gravitar and Frostbite (Atari) in order to validate its capabilities on difficult tasks. Our experiments show that time-myopic Go-Explore is an effective alternative for the domain-engineered heuristic while also being more general. The source code of the method is available on GitHub.  ( 2 min )
    A Novel Framework for Handling Sparse Data in Traffic Forecast. (arXiv:2301.05292v1 [cs.LG])
    The ever increasing amount of GPS-equipped vehicles provides in real-time valuable traffic information for the roads traversed by the moving vehicles. In this way, a set of sparse and time evolving traffic reports is generated for each road. These time series are a valuable asset in order to forecast the future traffic condition. In this paper we present a deep learning framework that encodes the sparse recent traffic information and forecasts the future traffic condition. Our framework consists of a recurrent part and a decoder. The recurrent part employs an attention mechanism that encodes the traffic reports that are available at a particular time window. The decoder is responsible to forecast the future traffic condition.  ( 2 min )
    Dynamic Data Assimilation of MPAS-O and the Global Drifter Dataset. (arXiv:2301.05551v1 [physics.ao-ph])
    In this study, we propose a new method for combining in situ buoy measurements with Earth system models (ESMs) to improve the accuracy of temperature predictions in the ocean. The technique utilizes the dynamics and modes identified in ESMs to improve the accuracy of buoy measurements while still preserving features such as seasonality. Using this technique, errors in localized temperature predictions made by the MPAS-O model can be corrected. We demonstrate that our approach improves accuracy compared to other interpolation and data assimilation methods. We apply our method to assimilate the Model for Prediction Across Scales Ocean component (MPAS-O) with the Global Drifter Program's in-situ ocean buoy dataset.  ( 2 min )
    LVRNet: Lightweight Image Restoration for Aerial Images under Low Visibility. (arXiv:2301.05434v1 [cs.CV])
    Learning to recover clear images from images having a combination of degrading factors is a challenging task. That being said, autonomous surveillance in low visibility conditions caused by high pollution/smoke, poor air quality index, low light, atmospheric scattering, and haze during a blizzard becomes even more important to prevent accidents. It is thus crucial to form a solution that can result in a high-quality image and is efficient enough to be deployed for everyday use. However, the lack of proper datasets available to tackle this task limits the performance of the previous methods proposed. To this end, we generate the LowVis-AFO dataset, containing 3647 paired dark-hazy and clear images. We also introduce a lightweight deep learning model called Low-Visibility Restoration Network (LVRNet). It outperforms previous image restoration methods with low latency, achieving a PSNR value of 25.744 and an SSIM of 0.905, making our approach scalable and ready for practical use. The code and data can be found at https://github.com/Achleshwar/LVRNet.  ( 2 min )
    A Comprehensive Survey to Dataset Distillation. (arXiv:2301.05603v1 [cs.LG])
    Deep learning technology has unprecedentedly developed in the last decade and has become the primary choice in many application domains. This progress is mainly attributed to a systematic collaboration that rapidly growing computing resources encourage advanced algorithms to deal with massive data. However, it gradually becomes challenging to cope with the unlimited growth of data with limited computing power. To this end, diverse approaches are proposed to improve data processing efficiency. Dataset distillation, one of the dataset reduction methods, tackles the problem via synthesising a small typical dataset from giant data and has attracted a lot of attention from the deep learning community. Existing dataset distillation methods can be taxonomised into meta-learning and data match framework according to whether explicitly mimic target data. Albeit dataset distillation has shown a surprising performance in compressing datasets, it still possesses several limitations such as distilling high-resolution data. This paper provides a holistic understanding of dataset distillation from multiple aspects, including distillation frameworks and algorithms, disentangled dataset distillation, performance comparison, and applications. Finally, we discuss challenges and promising directions to further promote future studies about dataset distillation.  ( 2 min )
    Predictions of photophysical properties of phosphorescent platinum(II) complexes based on ensemble machine learning approach. (arXiv:2301.05639v1 [cs.LG])
    Phosphorescent metal complexes have been under intense investigations as emissive dopants for energy efficient organic light emitting diodes (OLEDs). Among them, cyclometalated Pt(II) complexes are widespread triplet emitters with color-tunable emissions. To render their practical applications as OLED emitters, it is in great need to develop Pt(II) complexes with high radiative decay rate constant ($k_r$) and photoluminescence (PL) quantum yield. Thus, an efficient and accurate prediction tool is highly desirable. Here, we develop a general protocol for accurate predictions of emission wavelength, radiative decay rate constant, and PL quantum yield for phosphorescent Pt(II) emitters based on the combination of first-principles quantum mechanical method, machine learning (ML) and experimental calibration. A new dataset concerning phosphorescent Pt(II) emitters is constructed, with more than two hundred samples collected from the literature. Features containing pertinent electronic properties of the complexes are chosen. Our results demonstrate that ensemble learning models combined with stacking-based approaches exhibit the best performance, where the values of squared correlation coefficients ($R^2$), mean absolute error (MAE), and root mean square error (RMSE) are 0.96, 7.21 nm and 13.00 nm for emission wavelength prediction, and 0.81, 0.11 and 0.15 for PL quantum yield prediction. For radiative decay rate constant ($k_r$), the obtained value of $R^2$ is 0.67 while MAE and RMSE are 0.21 and 0.25 (both in log scale), respectively. The accuracy of the protocol is further confirmed using 24 recently reported Pt(II) complexes, which demonstrates its reliability for a broad palette of Pt(II) emitters.We expect this protocol will become a valuable tool, accelerating the rational design of novel OLED materials with desired properties.  ( 3 min )
    Detection problems in the spiked matrix models. (arXiv:2301.05331v1 [math.ST])
    We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals. As an intermediate step, we find out sharp phase transition thresholds for the extreme eigenvalues of spiked random matrices, which generalize the Baik-Ben Arous-P\'{e}ch\'{e} (BBP) transition. We also prove the central limit theorem for the linear spectral statistics for the spiked random matrices and propose a hypothesis test based on it, which does not depend on the distribution of the signal or the noise. When the noise is non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix with additive noise. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.  ( 2 min )
    Inaccessible Neural Language Models Could Reinvigorate Linguistic Nativism. (arXiv:2301.05272v1 [cs.CL])
    Large Language Models (LLMs) have been making big waves in the machine learning community within the past few years. The impressive scalability of LLMs due to the advent of deep learning can be seen as a continuation of empiricist lingusitic methods, as opposed to rule-based linguistic methods that are grounded in a nativist perspective. Current LLMs are generally inaccessible to resource-constrained researchers, due to a variety of factors including closed source code. This work argues that this lack of accessibility could instill a nativist bias in researchers new to computational linguistics, given that new researchers may only have rule-based, nativist approaches to study to produce new work. Also, given that there are numerous critics of deep learning claiming that LLMs and related methods may soon lose their relevancy, we speculate that such an event could trigger a new wave of nativism in the language processing community. To prevent such a dramatic shift and placing favor in hybrid methods of rules and deep learning, we call upon researchers to open source their LLM code wherever possible to allow both empircist and hybrid approaches to remain accessible.  ( 2 min )
    A survey and taxonomy of loss functions in machine learning. (arXiv:2301.05579v1 [cs.LG])
    Most state-of-the-art machine learning techniques revolve around the optimisation of loss functions. Defining appropriate loss functions is therefore critical to successfully solving problems in this field. We present a survey of the most commonly used loss functions for a wide range of different applications, divided into classification, regression, ranking, sample generation and energy based modelling. Overall, we introduce 33 different loss functions and we organise them into an intuitive taxonomy. Each loss function is given a theoretical backing and we describe where it is best used. This survey aims to provide a reference of the most essential loss functions for both beginner and advanced machine learning practitioners.  ( 2 min )
    Building a Fuel Moisture Model for the Coupled Fire-Atmosphere Model WRF-SFIRE from Data: From Kalman Filters to Recurrent Neural Networks. (arXiv:2301.05427v1 [cs.LG])
    The current fuel moisture content (FMC) subsystems in WRF-SFIRE and its workflow system WRFx use a time-lag differential equation model with assimilation of data from FMC sensors on Remote Automated Weather Stations (RAWS) by the extended augmented Kalman filter. But the quality of the result is constrained by the limitations of the model and of the Kalman filter. We observe that the data flow in a system consisting of a model and the Kalman filter can be interpreted to be the same as the data flow in a recurrent neural network (RNN). Thus, instead of building more sophisticated models and data assimilation methods, we want to train a RNN to approximate the dynamics of the response of the FMC sensor to a time series of environmental data. Because standard AI approaches did not converge to reasonable solutions, we pre-train the RNN with special initial weights devised to turn it into a numerical solver of the differential equation. We then allow the AI training machinery to optimize the RNN weights to fit the data better. We illustrate the method on an example of a time series of 10h-FMC from RAWS and weather data from the Real-Time Mesoscale Analysis (RTMA).  ( 2 min )
    Decentralized model-free reinforcement learning in stochastic games with average-reward objective. (arXiv:2301.05630v1 [cs.LG])
    We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the "communicating" assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.  ( 2 min )
    Applied Computer Vision on 2-Dimensional Lung X-Ray Images for Assisted Medical Diagnosis of Pneumonia. (arXiv:2207.13295v1 [eess.IV] CROSS LISTED)
    This study focuses on the application of a specific subfield of artificial intelligence referred to as computer vision in the analysis of 2-dimensional lung x-ray images for the assisted medical diagnosis of ordinary pneumonia. A convolutional neural network algorithm was implemented in a Python-coded, Flask-based web application that can analyze x-ray images for the detection of ordinary pneumonia. Since convolutional neural network algorithms rely on machine learning for the identification and detection of patterns, a technique referred to as transfer learning was implemented to train the neural network in the identification and detection of patterns within the dataset. Open-source lung x-ray images were used as training data to create a knowledge base that served as the core element of the web application and the experimental design employed a 5-Trial Confirmatory Test for the validation of the web application. The results of the 5-Trial Confirmatory Test show the calculation of Diagnostic Precision Percentage per Trial, General Diagnostic Precision Percentage, and General Diagnostic Error Percentage while the Confusion Matrix further shows the relationship between the label and the corresponding diagnosis result of the web application on each test images. The developed web application can be used by medical practitioners in A.I.-assisted diagnosis of ordinary pneumonia, and by researchers in the fields of computer science and bioinformatics.  ( 2 min )
    Distributed Online Private Learning of Convex Nondecomposable Objectives. (arXiv:2206.07944v3 [math.OC] UPDATED)
    We deal with a general distributed constrained online learning problem with privacy over time-varying networks, where a class of nondecomposable objectives are considered. Under this setting, each node only controls a part of the global decision, and the goal of all nodes is to collaboratively minimize the global cost over a time horizon $T$ while guarantees the security of the transmitted information. For such problems, we first design a novel generic algorithm framework, named as DPSDA, of differentially private distributed online learning using the Laplace mechanism and the stochastic variants of dual averaging method. Note that in the dual updates, all nodes of DPSDA employ the noise-corrupted gradients for more generality. Then, we propose two algorithms, named as DPSDA-C and DPSDA-PS, under this framework. In DPSDA-C, the nodes implement a circulation-based communication in the primal updates so as to alleviate the disagreements over time-varying undirected networks. In addition, for the extension to time-varying directed ones, the nodes implement the broadcast-based push-sum dynamics in DPSDA-PS, which can achieve average consensus over arbitrary directed networks. Theoretical results show that both algorithms attain an expected regret upper bound in $\mathcal{O}( \sqrt{T} )$ when the objective function is convex, which matches the best utility achievable by cutting-edge algorithms. Finally, numerical experiment results on both synthetic and real-world datasets verify the effectiveness of our algorithms.  ( 2 min )
    Sem@$K$: Is my knowledge graph embedding model semantic-aware?. (arXiv:2301.05601v1 [cs.LG])
    Using knowledge graph embedding models (KGEMs) is a popular approach for predicting links in knowledge graphs (KGs). Traditionally, the performance of KGEMs for link prediction is assessed using rank-based metrics, which evaluate their ability to give high scores to ground-truth entities. However, the literature claims that the KGEM evaluation procedure would benefit from adding supplementary dimensions to assess. That is why, in this paper, we extend our previously introduced metric Sem@$K$ that measures the capability of models to predict valid entities w.r.t. domain and range constrains. In particular, we consider a broad range of KGs and take their respective characteristics into account to propose different versions of Sem@$K$. We also perform an extensive study of KGEM semantic awareness. Our experiments show that Sem@$K$ provides a new perspective on KGEM quality. Its joint analysis with rank-based metrics offer different conclusions on the predictive power of models. Regarding Sem@$K$, some KGEMs are inherently better than others, but this semantic superiority is not indicative of their performance w.r.t. rank-based metrics. In this work, we generalize conclusions about the relative performance of KGEMs w.r.t. rank-based and semantic-oriented metrics at the level of families of models. The joint analysis of the aforementioned metrics gives more insight into the peculiarities of each model. This work paves the way for a more comprehensive evaluation of KGEM adequacy for specific downstream tasks.  ( 2 min )
    Scalable Batch Acquisition for Deep Bayesian Active Learning. (arXiv:2301.05490v1 [cs.LG])
    In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.  ( 2 min )
    Multilingual Alzheimer's Dementia Recognition through Spontaneous Speech: a Signal Processing Grand Challenge. (arXiv:2301.05562v1 [eess.AS])
    This Signal Processing Grand Challenge (SPGC) targets a difficult automatic prediction problem of societal and medical relevance, namely, the detection of Alzheimer's Dementia (AD). Participants were invited to employ signal processing and machine learning methods to create predictive models based on spontaneous speech data. The Challenge has been designed to assess the extent to which predictive models built based on speech in one language (English) generalise to another language (Greek). To the best of our knowledge no work has investigated acoustic features of the speech signal in multilingual AD detection. Our baseline system used conventional machine learning algorithms with Active Data Representation of acoustic features, achieving accuracy of 73.91% on AD detection, and 4.95 root mean squared error on cognitive score prediction.  ( 2 min )
    Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning. (arXiv:2301.05664v1 [cs.LG])
    In safety-critical decision-making scenarios being able to identify worst-case outcomes, or dead-ends is crucial in order to develop safe and reliable policies in practice. These situations are typically rife with uncertainty due to unknown or stochastic characteristics of the environment as well as limited offline training data. As a result, the value of a decision at any time point should be based on the distribution of its anticipated effects. We propose a framework to identify worst-case decision points, by explicitly estimating distributions of the expected return of a decision. These estimates enable earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task. We demonstrate the utility of Distributional Dead-end Discovery (DistDeD) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.  ( 2 min )
    Amenable Sparse Network Investigator. (arXiv:2202.09284v2 [cs.LG] UPDATED)
    We present "Amenable Sparse Network Investigator" (ASNI) algorithm that utilizes a novel pruning strategy based on a sigmoid function that induces sparsity level globally over the course of one single round of training. The ASNI algorithm fulfills both tasks that current state-of-the-art strategies can only do one of them. The ASNI algorithm has two subalgorithms: 1) ASNI-I, 2) ASNI-II. ASNI-I learns an accurate sparse off-the-shelf network only in one single round of training. ASNI-II learns a sparse network and an initialization that is quantized, compressed, and from which the sparse network is trainable. The learned initialization is quantized since only two numbers are learned for initialization of nonzero parameters in each layer L. Thus, quantization levels for the initialization of the entire network is 2L. Also, the learned initialization is compressed because it is a set consisting of 2L numbers. The special sparse network that can be trained from such a quantized and compressed initialization is called amenable. To the best of our knowledge, there is no other algorithm that can learn a quantized and compressed initialization from which the network is still trainable and is able to solve both pruning tasks. Our numerical experiments show that there is a quantized and compressed initialization from which the learned sparse network can be trained and reach to an accuracy on a par with the dense version. We experimentally show that these 2L levels of quantization are concentration points of parameters in each layer of the learned sparse network by ASNI-I. To corroborate the above, we have performed a series of experiments utilizing networks such as ResNets, VGG-style, small convolutional, and fully connected ones on ImageNet, CIFAR10, and MNIST datasets.  ( 2 min )
    Accelerating nuclear-norm regularized low-rank matrix optimization through Burer-Monteiro decomposition. (arXiv:2204.14067v2 [cs.LG] UPDATED)
    This work proposes a rapid algorithm, BM-Global, for nuclear-norm-regularized convex and low-rank matrix optimization problems. BM-Global efficiently decreases the objective value via low-cost steps leveraging the nonconvex but smooth Burer-Monteiro (BM) decomposition, while effectively escapes saddle points and spurious local minima ubiquitous in the BM form to obtain guarantees of fast convergence rates to the global optima of the original nuclear-norm-regularized problem through aperiodic inexact proximal gradient steps on it. The proposed approach adaptively adjusts the rank for the BM decomposition and can provably identify an optimal rank for the BM decomposition problem automatically in the course of optimization through tools of manifold identification. BM-Global hence also spends significantly less time on parameter tuning than existing matrix-factorization methods, which require an exhaustive search for finding this optimal rank. Extensive experiments on real-world large-scale problems of recommendation systems, regularized kernel estimation, and molecular conformation confirm that BM-Global can indeed effectively escapes spurious local minima at which existing BM approaches are stuck, and is a magnitude faster than state-of-the-art algorithms for low-rank matrix optimization problems involving a nuclear-norm regularizer.  ( 2 min )
    Generalization Properties of NAS under Activation and Skip Connection Search. (arXiv:2209.07238v3 [cs.LG] UPDATED)
    Neural Architecture Search (NAS) has fostered the automatic discovery of state-of-the-art neural architectures. Despite the progress achieved with NAS, so far there is little attention to theoretical guarantees on NAS. In this work, we study the generalization properties of NAS under a unifying framework enabling (deep) layer skip connection search and activation function search. To this end, we derive the lower (and upper) bounds of the minimum eigenvalue of the Neural Tangent Kernel (NTK) under the (in)finite-width regime using a certain search space including mixed activation functions, fully connected, and residual neural networks. We use the minimum eigenvalue to establish generalization error bounds of NAS in the stochastic gradient descent training. Importantly, we theoretically and experimentally show how the derived results can guide NAS to select the top-performing architectures, even in the case without training, leading to a train-free algorithm based on our theory. Accordingly, our numerical validation shed light on the design of computationally efficient methods for NAS. Our analysis is non-trivial due to the coupling of various architectures and activation functions under the unifying framework and has its own interest in providing the lower bound of the minimum eigenvalue of NTK in deep learning theory.  ( 2 min )
    Out-Of-Distribution Detection Is Not All You Need. (arXiv:2211.16158v2 [cs.LG] UPDATED)
    The usage of deep neural networks in safety-critical systems is limited by our ability to guarantee their correct behavior. Runtime monitors are components aiming to identify unsafe predictions and discard them before they can lead to catastrophic consequences. Several recent works on runtime monitoring have focused on out-of-distribution (OOD) detection, i.e., identifying inputs that are different from the training data. In this work, we argue that OOD detection is not a well-suited framework to design efficient runtime monitors and that it is more relevant to evaluate monitors based on their ability to discard incorrect predictions. We call this setting out-ofmodel-scope detection and discuss the conceptual differences with OOD. We also conduct extensive experiments on popular datasets from the literature to show that studying monitors in the OOD setting can be misleading: 1. very good OOD results can give a false impression of safety, 2. comparison under the OOD setting does not allow identifying the best monitor to detect errors. Finally, we also show that removing erroneous training data samples helps to train better monitors.  ( 2 min )
    Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey. (arXiv:2205.04712v2 [cs.LG] UPDATED)
    The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving.  ( 2 min )
    Hyperparameter Optimization as a Service on INFN Cloud. (arXiv:2301.05522v1 [cs.DC])
    The simplest and often most effective way of parallelizing the training of complex machine learning models is to execute several training instances on multiple machines, possibly scanning the hyperparameter space to optimize the underlying statistical model and the learning procedure. Often, such a meta learning procedure is limited by the ability of accessing securely a common database organizing the knowledge of the previous and ongoing trials. Exploiting opportunistic GPUs provided in different environments represents a further challenge when designing such optimization campaigns. In this contribution we discuss how a set of RestAPIs can be used to access a dedicated service based on INFN Cloud to monitor and possibly coordinate multiple training instances, with gradient-less optimization techniques, via simple HTTP requests. The service, named Hopaas (Hyperparameter OPtimization As A Service), is made of web interface and sets of APIs implemented with a FastAPI back-end running through Uvicorn and NGINX in a virtual instance of INFN Cloud. The optimization algorithms are currently based on Bayesian techniques as provided by Optuna. A Python front-end is also made available for quick prototyping. We present applications to hyperparameter optimization campaigns performed combining private, INFN Cloud and CINECA resources.  ( 2 min )
    A Deep Reinforcement Learning Framework For Column Generation. (arXiv:2206.02568v3 [math.OC] UPDATED)
    Column Generation (CG) is an iterative algorithm for solving linear programs (LPs) with an extremely large number of variables (columns). CG is the workhorse for tackling large-scale \textit{integer} linear programs, which rely on CG to solve LP relaxations within a branch and price algorithm. Two canonical applications are the Cutting Stock Problem (CSP) and Vehicle Routing Problem with Time Windows (VRPTW). In VRPTW, for example, each binary variable represents the decision to include or exclude a \textit{route}, of which there are exponentially many; CG incrementally grows the subset of columns being used, ultimately converging to an optimal solution. We propose RLCG, the first Reinforcement Learning (RL) approach for CG. Unlike typical column selection rules which myopically select a column based on local information at each iteration, we treat CG as a sequential decision-making problem: the column selected in a given iteration affects subsequent column selections. This perspective lends itself to a Deep Reinforcement Learning approach that uses Graph Neural Networks (GNNs) to represent the variable-constraint structure in the LP of interest. We perform an extensive set of experiments using the publicly available BPPLIB benchmark for CSP and Solomon benchmark for VRPTW. RLCG converges faster and reduces the number of CG iterations by 22.4\% for CSP and 40.9\% for VRPTW on average compared to a commonly used greedy policy. Our code is available at https://github.com/chichengmessi/reinforcement-learning-for-column-generation.git.  ( 2 min )
    Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization). (arXiv:2209.07263v3 [cs.LG] UPDATED)
    We study the average robustness notion in deep neural networks in (selected) wide and narrow, deep and shallow, as well as lazy and non-lazy training settings. We prove that in the under-parameterized setting, width has a negative effect while it improves robustness in the over-parameterized setting. The effect of depth closely depends on the initialization and the training mode. In particular, when initialized with LeCun initialization, depth helps robustness with the lazy training regime. In contrast, when initialized with Neural Tangent Kernel (NTK) and He-initialization, depth hurts the robustness. Moreover, under the non-lazy training regime, we demonstrate how the width of a two-layer ReLU network benefits robustness. Our theoretical developments improve the results by [Huang et al. NeurIPS21; Wu et al. NeurIPS21] and are consistent with [Bubeck and Sellke NeurIPS21; Bubeck et al. COLT21].  ( 2 min )
    TUSK: Task-Agnostic Unsupervised Keypoints. (arXiv:2206.08460v2 [cs.CV] UPDATED)
    Existing unsupervised methods for keypoint learning rely heavily on the assumption that a specific keypoint type (e.g. elbow, digit, abstract geometric shape) appears only once in an image. This greatly limits their applicability, as each instance must be isolated before applying the method-an issue that is never discussed or evaluated. We thus propose a novel method to learn Task-agnostic, UnSupervised Keypoints (TUSK) which can deal with multiple instances. To achieve this, instead of the commonly-used strategy of detecting multiple heatmaps, each dedicated to a specific keypoint type, we use a single heatmap for detection, and enable unsupervised learning of keypoint types through clustering. Specifically, we encode semantics into the keypoints by teaching them to reconstruct images from a sparse set of keypoints and their descriptors, where the descriptors are forced to form distinct clusters in feature space around learned prototypes. This makes our approach amenable to a wider range of tasks than any previous unsupervised keypoint method: we show experiments on multiple-instance detection and classification, object discovery, and landmark detection-all unsupervised-with performance on par with the state of the art, while also being able to deal with multiple instances.  ( 2 min )
    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space. (arXiv:2206.11895v4 [cs.CV] UPDATED)
    Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.  ( 2 min )
    Non-Stochastic CDF Estimation Using Threshold Queries. (arXiv:2301.05682v1 [cs.LG])
    Estimating the empirical distribution of a scalar-valued data set is a basic and fundamental task. In this paper, we tackle the problem of estimating an empirical distribution in a setting with two challenging features. First, the algorithm does not directly observe the data; instead, it only asks a limited number of threshold queries about each sample. Second, the data are not assumed to be independent and identically distributed; instead, we allow for an arbitrary process generating the samples, including an adaptive adversary. These considerations are relevant, for example, when modeling a seller experimenting with posted prices to estimate the distribution of consumers' willingness to pay for a product: offering a price and observing a consumer's purchase decision is equivalent to asking a single threshold query about their value, and the distribution of consumers' values may be non-stationary over time, as early adopters may differ markedly from late adopters. Our main result quantifies, to within a constant factor, the sample complexity of estimating the empirical CDF of a sequence of elements of $[n]$, up to $\varepsilon$ additive error, using one threshold query per sample. The complexity depends only logarithmically on $n$, and our result can be interpreted as extending the existing logarithmic-complexity results for noisy binary search to the more challenging setting where noise is non-stochastic. Along the way to designing our algorithm, we consider a more general model in which the algorithm is allowed to make a limited number of simultaneous threshold queries on each sample. We solve this problem using Blackwell's Approachability Theorem and the exponential weights method. As a side result of independent interest, we characterize the minimum number of simultaneous threshold queries required by deterministic CDF estimation algorithms.  ( 2 min )
    TransfQMix: Transformers for Leveraging the Graph Structure of Multi-Agent Reinforcement Learning Problems. (arXiv:2301.05334v1 [cs.LG])
    Coordination is one of the most difficult aspects of multi-agent reinforcement learning (MARL). One reason is that agents normally choose their actions independently of one another. In order to see coordination strategies emerging from the combination of independent policies, the recent research has focused on the use of a centralized function (CF) that learns each agent's contribution to the team reward. However, the structure in which the environment is presented to the agents and to the CF is typically overlooked. We have observed that the features used to describe the coordination problem can be represented as vertex features of a latent graph structure. Here, we present TransfQMix, a new approach that uses transformers to leverage this latent structure and learn better coordination policies. Our transformer agents perform a graph reasoning over the state of the observable entities. Our transformer Q-mixer learns a monotonic mixing-function from a larger graph that includes the internal and external states of the agents. TransfQMix is designed to be entirely transferable, meaning that same parameters can be used to control and train larger or smaller teams of agents. This enables to deploy promising approaches to save training time and derive general policies in MARL, such as transfer learning, zero-shot transfer, and curriculum learning. We report TransfQMix's performances in the Spread and StarCraft II environments. In both settings, it outperforms state-of-the-art Q-Learning models, and it demonstrates effectiveness in solving problems that other methods can not solve.  ( 2 min )
    Deep Reinforcement Learning for Asset Allocation: Reward Clipping. (arXiv:2301.05300v1 [q-fin.CP])
    Recently, there are many trials to apply reinforcement learning in asset allocation for earning more stable profits. In this paper, we compare performance between several reinforcement learning algorithms - actor-only, actor-critic and PPO models. Furthermore, we analyze each models' character and then introduce the advanced algorithm, so called Reward clipping model. It seems that the Reward Clipping model is better than other existing models in finance domain, especially portfolio optimization - it has strength both in bull and bear markets. Finally, we compare the performance for these models with traditional investment strategies during decreasing and increasing markets.  ( 2 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v2 [cs.LG] UPDATED)
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    Deep Learning Symmetries and Their Lie Groups, Algebras, and Subalgebras from First Principles. (arXiv:2301.05638v1 [hep-ph])
    We design a deep-learning algorithm for the discovery and identification of the continuous group of symmetries present in a labeled dataset. We use fully connected neural networks to model the symmetry transformations and the corresponding generators. We construct loss functions that ensure that the applied transformations are symmetries and that the corresponding set of generators forms a closed (sub)algebra. Our procedure is validated with several examples illustrating different types of conserved quantities preserved by symmetry. In the process of deriving the full set of symmetries, we analyze the complete subgroup structure of the rotation groups $SO(2)$, $SO(3)$, and $SO(4)$, and of the Lorentz group $SO(1,3)$. Other examples include squeeze mapping, piecewise discontinuous labels, and $SO(10)$, demonstrating that our method is completely general, with many possible applications in physics and data science. Our study also opens the door for using a machine learning approach in the mathematical study of Lie groups and their properties.  ( 2 min )
    Almost Surely $\sqrt{T}$ Regret Bound for Adaptive LQR. (arXiv:2301.05537v1 [math.OC])
    The Linear-Quadratic Regulation (LQR) problem with unknown system parameters has been widely studied, but it has remained unclear whether $\tilde{ \mathcal{O}}(\sqrt{T})$ regret, which is the best known dependence on time, can be achieved almost surely. In this paper, we propose an adaptive LQR controller with almost surely $\tilde{ \mathcal{O}}(\sqrt{T})$ regret upper bound. The controller features a circuit-breaking mechanism, which circumvents potential safety breach and guarantees the convergence of the system parameter estimate, but is shown to be triggered only finitely often and hence has negligible effect on the asymptotic performance of the controller. The proposed controller is also validated via simulation on Tennessee Eastman Process~(TEP), a commonly used industrial process example.  ( 2 min )
    A Solver-Free Framework for Scalable Learning in Neural ILP Architectures. (arXiv:2210.09082v2 [cs.LG] UPDATED)
    There is a recent focus on designing architectures that have an Integer Linear Programming (ILP) layer within a neural model (referred to as Neural ILP in this paper). Neural ILP architectures are suitable for pure reasoning tasks that require data-driven constraint learning or for tasks requiring both perception (neural) and reasoning (ILP). A recent SOTA approach for end-to-end training of Neural ILP explicitly defines gradients through the ILP black box (Paulus et al. 2021) - this trains extremely slowly, owing to a call to the underlying ILP solver for every training data point in a minibatch. In response, we present an alternative training strategy that is solver-free, i.e., does not call the ILP solver at all at training time. Neural ILP has a set of trainable hyperplanes (for cost and constraints in ILP), together representing a polyhedron. Our key idea is that the training loss should impose that the final polyhedron separates the positives (all constraints satisfied) from the negatives (at least one violated constraint or a suboptimal cost value), via a soft-margin formulation. While positive example(s) are provided as part of the training data, we devise novel techniques for generating negative samples. Our solution is flexible enough to handle equality as well as inequality constraints. Experiments on several problems, both perceptual as well as symbolic, which require learning the constraints of an ILP, show that our approach has superior performance and scales much better compared to purely neural baselines and other state-of-the-art models that require solver-based training. In particular, we are able to obtain excellent performance in 9 x 9 symbolic and visual sudoku, to which the other Neural ILP solver is not able to scale.  ( 2 min )
    On "Deep Learning" Misconduct. (arXiv:2211.16350v4 [cs.LG] UPDATED)
    This is a theoretical paper, as a companion paper of the plenary talk for the same conference ISAIC 2022. In contrast to the author's plenary talk in the same conference, conscious learning (Weng, 2022b; Weng, 2022c) which develops a single network for a life (many tasks), "Deep Learning" trains multiple networks for each task. Although "Deep Learning" may use different learning modes, including supervised, reinforcement and adversarial modes, almost all "Deep Learning" projects apparently suffer from the same misconduct, called "data deletion" and "test on training data". This paper establishes a theorem that a simple method called Pure-Guess Nearest Neighbor (PGNN) reaches any required errors on validation data set and test data set, including zero-error requirements, through the same misconduct, as long as the test data set is in the possession of the authors and both the amount of storage space and the time of training are finite but unbounded. The misconduct violates well-known protocols called transparency and cross-validation. The nature of the misconduct is fatal, because in the absence of any disjoint test, "Deep Learning" is clearly not generalizable.  ( 2 min )
    Language-Informed Transfer Learning for Embodied Household Activities. (arXiv:2301.05318v1 [cs.RO])
    For service robots to become general-purpose in everyday household environments, they need not only a large library of primitive skills, but also the ability to quickly learn novel tasks specified by users. Fine-tuning neural networks on a variety of downstream tasks has been successful in many vision and language domains, but research is still limited on transfer learning between diverse long-horizon tasks. We propose that, compared to reinforcement learning for a new household activity from scratch, home robots can benefit from transferring the value and policy networks trained for similar tasks. We evaluate this idea in the BEHAVIOR simulation benchmark which includes a large number of household activities and a set of action primitives. For easy mapping between state spaces of different tasks, we provide a text-based representation and leverage language models to produce a common embedding space. The results show that the selection of similar source activities can be informed by the semantic similarity of state and goal descriptions with the target task. We further analyze the results and discuss ways to overcome the problem of catastrophic forgetting.  ( 2 min )
    Data-Efficient Structured Pruning via Submodular Optimization. (arXiv:2203.04940v3 [cs.LG] UPDATED)
    Structured pruning is an effective approach for compressing large pre-trained neural networks without significantly affecting their performance. However, most current structured pruning methods do not provide any performance guarantees, and often require fine-tuning, which makes them inapplicable in the limited-data regime. We propose a principled data-efficient structured pruning method based on submodular optimization. In particular, for a given layer, we select neurons/channels to prune and corresponding new weights for the next layer, that minimize the change in the next layer's input induced by pruning. We show that this selection problem is a weakly submodular maximization problem, thus it can be provably approximated using an efficient greedy algorithm. Our method is guaranteed to have an exponentially decreasing error between the original model and the pruned model outputs w.r.t the pruned size, under reasonable assumptions. It is also one of the few methods in the literature that uses only a limited-number of training data and no labels. Our experimental results demonstrate that our method outperforms state-of-the-art methods in the limited-data regime.  ( 2 min )
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v3 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    Learning with little mixing. (arXiv:2206.08269v2 [cs.LG] UPDATED)
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.  ( 2 min )
    Biases in Inverse Ising Estimates of Near-Critical Behaviour. (arXiv:2301.05556v1 [cond-mat.dis-nn])
    Inverse Ising inference allows pairwise interactions of complex binary systems to be reconstructed from empirical correlations. Typical estimators used for this inference, such as Pseudo-likelihood maximization (PLM), are biased. Using the Sherrington-Kirkpatrick (SK) model as a benchmark, we show that these biases are large in critical regimes close to phase boundaries, and may alter the qualitative interpretation of the inferred model. In particular, we show that the small-sample bias causes models inferred through PLM to appear closer-to-criticality than one would expect from the data. Data-driven methods to correct this bias are explored and applied to a functional magnetic resonance imaging (fMRI) dataset from neuroscience. Our results indicate that additional care should be taken when attributing criticality to real-world datasets.  ( 2 min )
    An efficient hybrid classification approach for COVID-19 based on Harris Hawks Optimization and Salp Swarm Optimization. (arXiv:2301.05296v1 [cs.NE])
    Feature selection can be defined as one of the pre-processing steps that decrease the dimensionality of a dataset by identifying the most significant attributes while also boosting the accuracy of classification. For solving feature selection problems, this study presents a hybrid binary version of the Harris Hawks Optimization algorithm (HHO) and Salp Swarm Optimization (SSA) (HHOSSA) for Covid-19 classification. The proposed (HHOSSA) presents a strategy for improving the basic HHO's performance using the Salp algorithm's power to select the best fitness values. The HHOSSA was tested against two well-known optimization algorithms, the Whale Optimization Algorithm (WOA) and the Grey wolf optimizer (GWO), utilizing a total of 800 chest X-ray images. A total of four performance metrics (Accuracy, Recall, Precision, F1) were employed in the studies using three classifiers (Support vector machines (SVMs), k-Nearest Neighbor (KNN), and Extreme Gradient Boosting (XGBoost)). The proposed algorithm (HHOSSA) achieved 96% accuracy with the SVM classifier, and 98% accuracy with two classifiers, XGboost and KNN.  ( 2 min )
    Neural network with optimal neuron activation functions based on additive Gaussian process regression. (arXiv:2301.05567v1 [stat.ML])
    Feed-forward neural networks (NN) are a staple machine learning method widely used in many areas of science and technology. While even a single-hidden layer NN is a universal approximator, its expressive power is limited by the use of simple neuron activation functions (such as sigmoid functions) that are typically the same for all neurons. More flexible neuron activation functions would allow using fewer neurons and layers and thereby save computational cost and improve expressive power. We show that additive Gaussian process regression (GPR) can be used to construct optimal neuron activation functions that are individual to each neuron. An approach is also introduced that avoids non-linear fitting of neural network parameters. The resulting method combines the advantage of robustness of a linear regression with the higher expressive power of a NN. We demonstrate the approach by fitting the potential energy surface of the water molecule. Without requiring any non-linear optimization, the additive GPR based approach outperforms a conventional NN in the high accuracy regime, where a conventional NN suffers more from overfitting.  ( 2 min )
    On the explainability of quantum neural networks based on variational quantum circuits. (arXiv:2301.05549v1 [quant-ph])
    Ridge functions are used to describe and study the lower bound of the approximation done by the neural networks which can be written as a linear combination of activation functions. If the activation functions are also ridge functions, these networks are called explainable neural networks. In this paper, we first show that quantum neural networks which are based on variational quantum circuits can be written as a linear combination of ridge functions. Consequently, we show that the interpretability and explainability of such quantum neural networks can be directly considered and studied as an approximation with the linear combination of ridge functions.  ( 2 min )
    Knowledge Enhancement for Multi-Behavior Contrastive Recommendation. (arXiv:2301.05403v1 [cs.IR])
    A well-designed recommender system can accurately capture the attributes of users and items, reflecting the unique preferences of individuals. Traditional recommendation techniques usually focus on modeling the singular type of behaviors between users and items. However, in many practical recommendation scenarios (e.g., social media, e-commerce), there exist multi-typed interactive behaviors in user-item relationships, such as click, tag-as-favorite, and purchase in online shopping platforms. Thus, how to make full use of multi-behavior information for recommendation is of great importance to the existing system, which presents challenges in two aspects that need to be explored: (1) Utilizing users' personalized preferences to capture multi-behavioral dependencies; (2) Dealing with the insufficient recommendation caused by sparse supervision signal for target behavior. In this work, we propose a Knowledge Enhancement Multi-Behavior Contrastive Learning Recommendation (KMCLR) framework, including two Contrastive Learning tasks and three functional modules to tackle the above challenges, respectively. In particular, we design the multi-behavior learning module to extract users' personalized behavior information for user-embedding enhancement, and utilize knowledge graph in the knowledge enhancement module to derive more robust knowledge-aware representations for items. In addition, in the optimization stage, we model the coarse-grained commonalities and the fine-grained differences between multi-behavior of users to further improve the recommendation effect. Extensive experiments and ablation tests on the three real-world datasets indicate our KMCLR outperforms various state-of-the-art recommendation methods and verify the effectiveness of our method.  ( 2 min )
    AAAI 2022 Fall Symposium: Lessons Learned for Autonomous Assessment of Machine Abilities (LLAAMA). (arXiv:2301.05384v1 [cs.LG])
    Modern civilian and military systems have created a demand for sophisticated intelligent autonomous machines capable of operating in uncertain dynamic environments. Such systems are realizable thanks in large part to major advances in perception and decision-making techniques, which in turn have been propelled forward by modern machine learning tools. However, these newer forms of intelligent autonomy raise questions about when/how communication of the operational intent and assessments of actual vs. supposed capabilities of autonomous agents impact overall performance. This symposium examines the possibilities for enabling intelligent autonomous systems to self-assess and communicate their ability to effectively execute assigned tasks, as well as reason about the overall limits of their competencies and maintain operability within those limits. The symposium brings together researchers working in this burgeoning area of research to share lessons learned, identify major theoretical and practical challenges encountered so far, and potential avenues for future research and real-world applications.  ( 2 min )
    HTTE: A Hybrid Technique For Travel Time Estimation In Sparse Data Environments. (arXiv:2301.05293v1 [cs.LG])
    Travel time estimation is a critical task, useful to many urban applications at the individual citizen and the stakeholder level. This paper presents a novel hybrid algorithm for travel time estimation that leverages historical and sparse real-time trajectory data. Given a path and a departure time we estimate the travel time taking into account the historical information, the real-time trajectory data and the correlations among different road segments. We detect similar road segments using historical trajectories, and use a latent representation to model the similarities. Our experimental evaluation demonstrates the effectiveness of our approach.  ( 2 min )
    In BLOOM: Creativity and Affinity in Artificial Lyrics and Art. (arXiv:2301.05402v1 [cs.CL])
    We apply a large multilingual language model (BLOOM-176B) in open-ended generation of Chinese song lyrics, and evaluate the resulting lyrics for coherence and creativity using human reviewers. We find that current computational metrics for evaluating large language model outputs (MAUVE) have limitations in evaluation of creative writing. We note that the human concept of creativity requires lyrics to be both comprehensible and distinctive -- and that humans assess certain types of machine-generated lyrics to score more highly than real lyrics by popular artists. Inspired by the inherently multimodal nature of album releases, we leverage a Chinese-language stable diffusion model to produce high-quality lyric-guided album art, demonstrating a creative approach for an artist seeking inspiration for an album or single. Finally, we introduce the MojimLyrics dataset, a Chinese-language dataset of popular song lyrics for future research.  ( 2 min )
    A Constrained-Optimization Approach to the Execution of Prioritized Stacks of Learned Multi-Robot Tasks. (arXiv:2301.05346v1 [cs.RO])
    This paper presents a constrained-optimization formulation for the prioritized execution of learned robot tasks. The framework lends itself to the execution of tasks encoded by value functions, such as tasks learned using the reinforcement learning paradigm. The tasks are encoded as constraints of a convex optimization program by using control Lyapunov functions. Moreover, an additional constraint is enforced in order to specify relative priorities between the tasks. The proposed approach is showcased in simulation using a team of mobile robots executing coordinated multi-robot tasks.  ( 2 min )
    A Scalable Technique for Weak-Supervised Learning with Domain Constraints. (arXiv:2301.05253v1 [cs.LG])
    We propose a novel scalable end-to-end pipeline that uses symbolic domain knowledge as constraints for learning a neural network for classifying unlabeled data in a weak-supervised manner. Our approach is particularly well-suited for settings where the data consists of distinct groups (classes) that lends itself to clustering-friendly representation learning and the domain constraints can be reformulated for use of efficient mathematical optimization techniques by considering multiple training examples at once. We evaluate our approach on a variant of the MNIST image classification problem where a training example consists of image sequences and the sum of the numbers represented by the sequences, and show that our approach scales significantly better than previous approaches that rely on computing all constraint satisfying combinations for each training example.  ( 2 min )
    Equivariant Representations for Non-Free Group Actions. (arXiv:2301.05231v1 [cs.LG])
    We introduce a method for learning representations that are equivariant with respect to general group actions over data. Differently from existing equivariant representation learners, our method is suitable for actions that are not free i.e., that stabilize data via nontrivial symmetries. Our method is grounded in the orbit-stabilizer theorem from group theory, which guarantees that an ideal learner infers an isomorphic representation. Finally, we provide an empirical investigation on image datasets with rotational symmetries and show that taking stabilizers into account improves the quality of the representations.  ( 2 min )
  • Open

    Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning. (arXiv:2301.05664v1 [cs.LG])
    In safety-critical decision-making scenarios being able to identify worst-case outcomes, or dead-ends is crucial in order to develop safe and reliable policies in practice. These situations are typically rife with uncertainty due to unknown or stochastic characteristics of the environment as well as limited offline training data. As a result, the value of a decision at any time point should be based on the distribution of its anticipated effects. We propose a framework to identify worst-case decision points, by explicitly estimating distributions of the expected return of a decision. These estimates enable earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task. We demonstrate the utility of Distributional Dead-end Discovery (DistDeD) in a toy domain as well as when assessing the risk of severely ill patients in the intensive care unit reaching a point where death is unavoidable. We find that DistDeD significantly improves over prior discovery approaches, providing indications of the risk 10 hours earlier on average as well as increasing detection by 20%.  ( 2 min )
    Neural network with optimal neuron activation functions based on additive Gaussian process regression. (arXiv:2301.05567v1 [stat.ML])
    Feed-forward neural networks (NN) are a staple machine learning method widely used in many areas of science and technology. While even a single-hidden layer NN is a universal approximator, its expressive power is limited by the use of simple neuron activation functions (such as sigmoid functions) that are typically the same for all neurons. More flexible neuron activation functions would allow using fewer neurons and layers and thereby save computational cost and improve expressive power. We show that additive Gaussian process regression (GPR) can be used to construct optimal neuron activation functions that are individual to each neuron. An approach is also introduced that avoids non-linear fitting of neural network parameters. The resulting method combines the advantage of robustness of a linear regression with the higher expressive power of a NN. We demonstrate the approach by fitting the potential energy surface of the water molecule. Without requiring any non-linear optimization, the additive GPR based approach outperforms a conventional NN in the high accuracy regime, where a conventional NN suffers more from overfitting.  ( 2 min )
    Learning with little mixing. (arXiv:2206.08269v2 [cs.LG] UPDATED)
    We study square loss in a realizable time-series framework with martingale difference noise. Our main result is a fast rate excess risk bound which shows that whenever a trajectory hypercontractivity condition holds, the risk of the least-squares estimator on dependent data matches the iid rate order-wise after a burn-in time. In comparison, many existing results in learning from dependent data have rates where the effective sample size is deflated by a factor of the mixing-time of the underlying process, even after the burn-in time. Furthermore, our results allow the covariate process to exhibit long range correlations which are substantially weaker than geometric ergodicity. We call this phenomenon learning with little mixing, and present several examples for when it occurs: bounded function classes for which the $L^2$ and $L^{2+\epsilon}$ norms are equivalent, ergodic finite state Markov chains, various parametric models, and a broad family of infinite dimensional $\ell^2(\mathbb{N})$ ellipsoids. By instantiating our main result to system identification of nonlinear dynamics with generalized linear model transitions, we obtain a nearly minimax optimal excess risk bound after only a polynomial burn-in time.  ( 2 min )
    Scalable Estimation for Structured Additive Distributional Regression. (arXiv:2301.05593v1 [stat.CO])
    Recently, fitting probabilistic models have gained importance in many areas but estimation of such distributional models with very large data sets is a difficult task. In particular, the use of rather complex models can easily lead to memory-related efficiency problems that can make estimation infeasible even on high-performance computers. We therefore propose a novel backfitting algorithm, which is based on the ideas of stochastic gradient descent and can deal virtually with any amount of data on a conventional laptop. The algorithm performs automatic selection of variables and smoothing parameters, and its performance is in most cases superior or at least equivalent to other implementations for structured additive distributional regression, e.g., gradient boosting, while maintaining low computation time. Performance is evaluated using an extensive simulation study and an exceptionally challenging and unique example of lightning count prediction over Austria. A very large dataset with over 9 million observations and 80 covariates is used, so that a prediction model cannot be estimated with standard distributional regression methods but with our new approach.  ( 2 min )
    Scalable Batch Acquisition for Deep Bayesian Active Learning. (arXiv:2301.05490v1 [cs.LG])
    In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.  ( 2 min )
    On the infinite-depth limit of finite-width neural networks. (arXiv:2210.00688v3 [stat.ML] UPDATED)
    In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing change of regime phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width and compare it with the more commonly studied infinite-width-then-infinite-depth limit.  ( 2 min )
    Fully Adaptive Composition in Differential Privacy. (arXiv:2203.05481v2 [cs.LG] UPDATED)
    Composition is a key feature of differential privacy. Well-known advanced composition theorems allow one to query a private database quadratically more times than basic privacy composition would permit. However, these results require that the privacy parameters of all algorithms be fixed before interacting with the data. To address this, Rogers et al. introduced fully adaptive composition, wherein both algorithms and their privacy parameters can be selected adaptively. The authors introduce two probabilistic objects to measure privacy in adaptive composition: privacy filters, which provide differential privacy guarantees for composed interactions, and privacy odometers, time-uniform bounds on privacy loss. There are substantial gaps between advanced composition and existing filters and odometers. First, existing filters place stronger assumptions on the algorithms being composed. Second, these odometers and filters suffer from large constants, making them impractical. We construct filters that match the tightness of advanced composition, including constants, despite allowing for adaptively chosen privacy parameters. En route we also derive a privacy filter for approximate zCDP and approximate RDP. We also construct several general families of odometers. These odometers can match the tightness of advanced composition at an arbitrary, preselected point in time, or at all points in time simultaneously, up to a doubly-logarithmic factor. We obtain our results by leveraging recent advances in time-uniform martingale concentration. In sum, we show that fully adaptive privacy is obtainable at almost no loss, and conjecture that our results are essentially unimprovable (even in constants) in general.  ( 2 min )
    Stable Probability Weighting: Large-Sample and Finite-Sample Estimation and Inference Methods for Heterogeneous Causal Effects of Multivalued Treatments Under Limited Overlap. (arXiv:2301.05703v1 [econ.EM])
    In this paper, I try to tame "Basu's elephants" (data with extreme selection on observables). I propose new practical large-sample and finite-sample methods for estimating and inferring heterogeneous causal effects (under unconfoundedness) in the empirically relevant context of limited overlap. I develop a general principle called "Stable Probability Weighting" (SPW) that can be used as an alternative to the widely used Inverse Probability Weighting (IPW) technique, which relies on strong overlap. I show that IPW (or its augmented version), when valid, is a special case of the more general SPW (or its doubly robust version), which adjusts for the extremeness of the conditional probabilities of the treatment states. The SPW principle can be implemented using several existing large-sample parametric, semiparametric, and nonparametric procedures for conditional moment models. In addition, I provide new finite-sample results that apply when unconfoundedness is plausible within fine strata. Since IPW estimation relies on the problematic reciprocal of the estimated propensity score, I develop a "Finite-Sample Stable Probability Weighting" (FPW) set-estimator that is unbiased in a sense. I also propose new finite-sample inference methods for testing a general class of weak null hypotheses. The associated computationally convenient methods, which can be used to construct valid confidence sets and to bound the finite-sample confidence distribution, are of independent interest. My large-sample and finite-sample frameworks extend to the setting of multivalued treatments.  ( 2 min )
    Memory Efficient Continual Learning with Transformers. (arXiv:2203.04640v2 [cs.CL] UPDATED)
    In many real-world scenarios, data to train machine learning models becomes available over time. Unfortunately, these models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is difficult to prevent due to practical constraints. For instance, the amount of data that can be stored or the computational resources that can be used might be limited. Moreover, applications increasingly rely on large pre-trained neural networks, such as pre-trained Transformers, since the resources or data might not be available in sufficiently large quantities to practitioners to train the model from scratch. In this paper, we devise a method to incrementally train a model on a sequence of tasks using pre-trained Transformers and extending them with Adapters. Different than the existing approaches, our method is able to scale to a large number of tasks without significant overhead and allows sharing information across tasks. On both image and text classification tasks, we empirically demonstrate that our method maintains a good predictive performance without retraining the model or increasing the number of model parameters over time. The resulting model is also significantly faster at inference time compared to Adapter-based state-of-the-art methods.  ( 2 min )
    Brownian Noise Reduction: Maximizing Privacy Subject to Accuracy Constraints. (arXiv:2206.07234v3 [cs.LG] UPDATED)
    There is a disconnect between how researchers and practitioners handle privacy-utility tradeoffs. Researchers primarily operate from a privacy first perspective, setting strict privacy requirements and minimizing risk subject to these constraints. Practitioners often desire an accuracy first perspective, possibly satisfied with the greatest privacy they can get subject to obtaining sufficiently small error. Ligett et al. have introduced a "noise reduction" algorithm to address the latter perspective. The authors show that by adding correlated Laplace noise and progressively reducing it on demand, it is possible to produce a sequence of increasingly accurate estimates of a private parameter while only paying a privacy cost for the least noisy iterate released. In this work, we generalize noise reduction to the setting of Gaussian noise, introducing the Brownian mechanism. The Brownian mechanism works by first adding Gaussian noise of high variance corresponding to the final point of a simulated Brownian motion. Then, at the practitioner's discretion, noise is gradually decreased by tracing back along the Brownian path to an earlier time. Our mechanism is more naturally applicable to the common setting of bounded $\ell_2$-sensitivity, empirically outperforms existing work on common statistical tasks, and provides customizable control of privacy loss over the entire interaction with the practitioner. We complement our Brownian mechanism with ReducedAboveThreshold, a generalization of the classical AboveThreshold algorithm that provides adaptive privacy guarantees. Overall, our results demonstrate that one can meet utility constraints while still maintaining strong levels of privacy.  ( 2 min )
    A fully Bayesian sparse polynomial chaos expansion approach with joint priors on the coefficients and global selection of terms. (arXiv:2204.06043v2 [stat.CO] UPDATED)
    Polynomial chaos expansion (PCE) is a versatile tool widely used in uncertainty quantification and machine learning, but its successful application depends strongly on the accuracy and reliability of the resulting PCE-based response surface. High accuracy typically requires high polynomial degrees, demanding many training points especially in high-dimensional problems through the curse of dimensionality. So-called sparse PCE concepts work with a much smaller selection of basis polynomials compared to conventional PCE approaches and can overcome the curse of dimensionality very efficiently, but have to pay specific attention to their strategies of choosing training points. Furthermore, the approximation error resembles an uncertainty that most existing PCE-based methods do not estimate. In this study, we develop and evaluate a fully Bayesian approach to establish the PCE representation via joint shrinkage priors and Markov chain Monte Carlo. The suggested Bayesian PCE model directly aims to solve the two challenges named above: achieving a sparse PCE representation and estimating uncertainty of the PCE itself. The embedded Bayesian regularizing via the joint shrinkage prior allows using higher polynomial degrees for given training points due to its ability to handle underdetermined situations, where the number of considered PCE coefficients could be much larger than the number of available training points. We also explore multiple variable selection methods to construct sparse PCE expansions based on the established Bayesian representations, while globally selecting the most meaningful orthonormal polynomials given the available training data. We demonstrate the advantages of our Bayesian PCE and the corresponding sparsity-inducing methods on several benchmarks.  ( 2 min )
    Detection problems in the spiked matrix models. (arXiv:2301.05331v1 [math.ST])
    We study the statistical decision process of detecting the low-rank signal from various signal-plus-noise type data matrices, known as the spiked random matrix models. We first show that the principal component analysis can be improved by entrywise pre-transforming the data matrix if the noise is non-Gaussian, generalizing the known results for the spiked random matrix models with rank-1 signals. As an intermediate step, we find out sharp phase transition thresholds for the extreme eigenvalues of spiked random matrices, which generalize the Baik-Ben Arous-P\'{e}ch\'{e} (BBP) transition. We also prove the central limit theorem for the linear spectral statistics for the spiked random matrices and propose a hypothesis test based on it, which does not depend on the distribution of the signal or the noise. When the noise is non-Gaussian noise, the test can be improved with an entrywise transformation to the data matrix with additive noise. We also introduce an algorithm that estimates the rank of the signal when it is not known a priori.  ( 2 min )
    Efficient and robust transfer learning of optimal individualized treatment regimes with right-censored survival data. (arXiv:2301.05491v1 [stat.ME])
    An individualized treatment regime (ITR) is a decision rule that assigns treatments based on patients' characteristics. The value function of an ITR is the expected outcome in a counterfactual world had this ITR been implemented. Recently, there has been increasing interest in combining heterogeneous data sources, such as leveraging the complementary features of randomized controlled trial (RCT) data and a large observational study (OS). Usually, a covariate shift exists between the source and target population, rendering the source-optimal ITR unnecessarily optimal for the target population. We present an efficient and robust transfer learning framework for estimating the optimal ITR with right-censored survival data that generalizes well to the target population. The value function accommodates a broad class of functionals of survival distributions, including survival probabilities and restrictive mean survival times (RMSTs). We propose a doubly robust estimator of the value function, and the optimal ITR is learned by maximizing the value function within a pre-specified class of ITRs. We establish the $N^{-1/3}$ rate of convergence for the estimated parameter indexing the optimal ITR, and show that the proposed optimal value estimator is consistent and asymptotically normal even with flexible machine learning methods for nuisance parameter estimation. We evaluate the empirical performance of the proposed method by simulation studies and a real data application of sodium bicarbonate therapy for patients with severe metabolic acidaemia in the intensive care unit (ICU), combining a RCT and an observational study with heterogeneity.  ( 2 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v2 [cs.LG] UPDATED)
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    Global Riemannian Acceleration in Hyperbolic and Spherical Spaces. (arXiv:2012.03618v5 [math.OC] UPDATED)
    We further research on the accelerated optimization phenomenon on Riemannian manifolds by introducing accelerated global first-order methods for the optimization of $L$-smooth and geodesically convex (g-convex) or $\mu$-strongly g-convex functions defined on the hyperbolic space or a subset of the sphere. For a manifold other than the Euclidean space, these are the first methods to \emph{globally} achieve the same rates as accelerated gradient descent in the Euclidean space with respect to $L$ and $\epsilon$ (and $\mu$ if it applies), up to log factors. Due to the geometric deformations, our rates have an extra factor, depending on the initial distance $R$ to a minimizer and the curvature $K$, with respect to Euclidean accelerated algorithms As a proxy for our solution, we solve a constrained non-convex Euclidean problem, under a condition between convexity and \emph{quasar-convexity}, of independent interest. Additionally, for any Riemannian manifold of bounded sectional curvature, we provide reductions from optimization methods for smooth and g-convex functions to methods for smooth and strongly g-convex functions and vice versa. We also reduce global optimization to optimization over bounded balls where the effect of the curvature is reduced.  ( 2 min )

  • Open

    Looking for a CV/ML freelancer
    We are currently working looking for someone to create an app that works for images and video where the user would highlight the outline of the person in the image or video and the app would return the image and video of the person with a transparent background. The user could then go back and keep highlighting to refine the image or video of the person. If the image or video of said person is good they would just save the it on the app itself. We would want this app to be made with swift for iOS and preferably on edge. At the end just send over the project folder. Dm if you are interested submitted by /u/bluebamboo3 [link] [comments]  ( 45 min )
    This AI can clone your voice! VALL-E (explained)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 44 min )
    A Video Script I Made Using ChatGPT & Pictory
    submitted by /u/WolfAmoux [link] [comments]  ( 44 min )
    6. Is it possible for crime to increase due to AI in the workforce?
    submitted by /u/Big-Heron-7955 [link] [comments]  ( 44 min )
    Did this mostly for fun once I found out the CW's Shared superhero multiverse was coming to a close. Figured I might as well post it here for anyone whose interested.
    submitted by /u/Eleganos [link] [comments]  ( 44 min )
    Make a Drake freestyle with AI
    Drake - Zesty (New Unreleased Freestyle) ​ Recently just uploaded his new freestyle on youtube. Pretty lit and original. Check it out: https://youtu.be/060cUnsoYo8 submitted by /u/Jay-Query [link] [comments]  ( 44 min )
    The media firms & publisher are beginning their fight back
    submitted by /u/MrEloi [link] [comments]  ( 45 min )
    Artists sue Stability AI, Midjourney and DeviantArt
    submitted by /u/Peaking_AI [link] [comments]  ( 44 min )
    Weekly China AI News: Alibaba Predicts Generative AI as Top Tech Trend for 2023, China's AI Computing Surpasses General Computing, and AI Can Recognize Lip Syncing
    submitted by /u/trcytony [link] [comments]  ( 44 min )
    Artificial intelligence
    submitted by /u/ZahraMuxammed [link] [comments]  ( 43 min )
    5 AI Content Detector Tools You Should Know About! (ChatGPT Not Included)
    submitted by /u/Chisom1998_ [link] [comments]  ( 50 min )
    Interview with the guy who developed wifi human movement tracking
    in this newsletter like halfway down how long before this just turns into the Dark Knight surveillance scenario? apparently you can use wifi antennas to track people submitted by /u/jrstelle [link] [comments]  ( 45 min )
    I made my first website with AI. It is Ingredient Genie, and it creates recipes based on your ingredients.
    submitted by /u/MightyMercenary0 [link] [comments]  ( 45 min )
    🚀Muse:Text 2 Image Generation via Masked Generative Transformers
    submitted by /u/oridnary_artist [link] [comments]  ( 45 min )
    Create an image generation model that takes concept art and turns it into a character sheet
    Hey all! I have been experimenting with a few AI apps (I was lucky enough to get into the leonardo beta). I am mainly interested in creating character sheets for 3D modelling like this: ​ https://preview.redd.it/7vv9zotsbfca1.jpg?width=564&format=pjpg&auto=webp&s=37df5fc0affb728338a3517d5710453b72b0830b Initially, I thought that training a model with a bunch of character sheets from google images would suffice. I figured if I used an image of a character (think general concept art, without the T-pose and different views) as the "input" image, it would spit out something resembling a character sheet, only with the character I want in it. This didn't work, it ended up re-imagining the character with the art style of the character sheet (hand drawn lines, cartoony): ​ https://preview.redd.it/rp6hh8qgdfca1.png?width=1486&format=png&auto=webp&s=81fb7607cbd58a938f25314c471772aaa5d8916c So I guess my question is: Is there an AI service/app/tool that might accomplish this? If not, what other methods should I look into? submitted by /u/matthew798 [link] [comments]  ( 48 min )
    OpenAI’s ChatGPT: The 10 Worst Things to Expect
    submitted by /u/liquidocelotYT [link] [comments]  ( 43 min )
    Interactive Evolutionary Computation and ChatGPT
    submitted by /u/BenjaminJamesBush [link] [comments]  ( 45 min )
    AI Host that independently runs Live shows (on Fb, Youtube, Twitch)
    Hi there! Making my first contribution to this subreddit and sharing something you probably haven't heard of (I might be wrong though :) So, this is about AI host for Live quizzes. The AI generated avatar can run a live quiz and doesn't need any pre-made scripts or your help. Works only on Facebook Live, Youtube and Twitch. What do you think? ​ https://preview.redd.it/94gb6wmc6fca1.png?width=1080&format=png&auto=webp&s=a14bfd0b1140e887d1b7c615fd0ef00e88373ce6 submitted by /u/AnnetWw [link] [comments]  ( 50 min )
    What is reinforcement learning from human feedback (RLHF)?
    submitted by /u/bendee983 [link] [comments]  ( 44 min )
    AI in Education: The Good, the Bad, and the Downright Confusing
    submitted by /u/pauerrrr [link] [comments]  ( 44 min )
    Android AI Assistant - Use GPT from anywhere!
    submitted by /u/better__ideas [link] [comments]  ( 44 min )
    I got ChatGPT to create a new joke. I would never have thought this possible.
    submitted by /u/Ivorius [link] [comments]  ( 48 min )
    Didn't a man invent ChatGPT?
    submitted by /u/Imagine-your-success [link] [comments]  ( 44 min )
    I made SaaS AI Tools, a collection of 400+ AI tools & daily AI news in one place.
    Hey, Over the past couple months, I've been collecting AI tools & generators and decided to put them into a website. The result is SaaS AI Tools, a growing collection of 400+ generative AI tools to help supercharge your creativity and take your business to the next level. Also, to differentiate a bit - I've added another section that involves a feed of daily AI articles, so you can keep up-to-date on the top AI headlines. This is how I'm personally keeping up with all the AI stuff today. I'll be adding more tools and news sources soon. I've launched the website on Product Hunt and would appreciate any of your support 🙏 submitted by /u/Hairy_Milk8431 [link] [comments]  ( 56 min )
    production still from 1976 of Alejandro Jodorowsky’s Spaceballs
    submitted by /u/dag [link] [comments]  ( 44 min )
  • Open

    [D] GCN datasets
    Hello everyone, ​ I have a question about GCNs and would appreciate any thoughts. Do we typically use only one graph for GCN training/inference? I'm asking this because when I saw official DGL website, there was only one example graph after loading it. Based on my experience with DNNs, I expected a batch of examples. However, it was not the case for GCNS. I could find PPI dataset with multiple graph examples (24) but for other widely used datasets (e.g., Cora, Citeseeer, and Pubmed), there was only one. Thank you! submitted by /u/ramya_1995 [link] [comments]  ( 56 min )
    [P] Looking for a CV/ML freelancer
    We are currently working looking for someone to create an app that works for images and video where the user would highlight the outline of the person in the image or video and the app would return the image and video of the person with a transparent background. The user could then go back and keep highlighting to refine the image or video of the person. If the image or video of said person is good they would just save the it on the app itself. We would want this app to be made with swift for iOS and preferably on edge. At the end just send over the project folder. Dm if you are interested. submitted by /u/bluebamboo3 [link] [comments]  ( 56 min )
    [D] Model for detecting rectangle corners?
    What model structure would be recommended for detecting the coordinates of all 4 corners of a rectangle (e.g. index cards)? Most object detection models like YOLO produce rectangular bounding boxes; what tweaks can be made to trace the object regardless of orientation? For my specific problem, classical edge/corner detectors aren't a good fit - so I'm falling back on ML. Currently have a dataset of about 1500 domain-specific labeled images; hoping to train a model on TF. Thanks for the suggestions! Edit: here are a few examples from my dataset. The green dots aren't part of the images; they just show how the corners are annotated: https://preview.redd.it/2f8uimhn7hca1.jpg?width=1373&format=pjpg&auto=webp&s=3a3757a6d3ab0f07aa3cde09f1b4acd0573f3d75 https://preview.redd.it/ujb8tmhn7hca1.jpg?width=3024&format=pjpg&auto=webp&s=e1a60b4322e3f20c10f193cb3102658975858c92 https://preview.redd.it/9lzgfmhn7hca1.jpg?width=3024&format=pjpg&auto=webp&s=a0cf4d760b48267d7c273f892284472f296f72be submitted by /u/hundley10 [link] [comments]  ( 57 min )
    [R] The Predictive Forward-Forward Algorithm
    Abstract: In this work, we propose a generalization of the forward-forward (FF) algorithm that we call the predictive forward-forward (PFF) algorithm. Specifically, we design a dynamic, recurrent neural system that learns a directed generative circuit jointly and simultaneously with a representation circuit, combining elements of predictive coding, an emerging and viable neurobiological process theory of cortical function, with the forward-forward adaptation scheme. Furthermore, PFF efficiently learns to propagate learning signals and updates synapses with forward passes only, eliminating some of the key structural and computational constraints imposed by a backpropbased scheme. Besides computational advantages, the PFF process could be further useful for understanding the learning mechanisms behind biological neurons that make use of local (and global) signals despite missing feedback connections [11]. We run several experiments on image data and demonstrate that the PFF procedure works as well as backprop, offering a promising brain-inspired algorithm for classifying, reconstructing, and synthesizing data patterns. As a result, our approach presents further evidence of the promise afforded by backprop-alternative credit assignment algorithms within the context of brain-inspired computing. Paper: https://arxiv.org/pdf/2301.01452.pdf submitted by /u/radi-cho [link] [comments]  ( 57 min )
    [D] On generated content and the future of moderation
    Over the past three years, the field of ML has advanced considerably in the field of audio, visual, and natural language generation. For users like me, GPT-3 was a first look into the types of content that can now be generated with minor effort. While impressive at first, the outputs from the original GPT-3 can quickly be seen to be less than ideal and often times can be easily distinguished from original content written by users. Three years later, generation techniques have improved to the point where the task of detecting generated content is far more difficult as the quality of the generated content has risen considerably. Access to such technologies has also spread to the point where states such as New York sees it as enough of a threat to ban it from schools. While I think we are still at the calm before the storm in regards to the potential for chaos such models have, I'd like to open the floor up for a discussion on the implications of generative models and ways we can address it. Will it even be possible to moderate content in the future when models improve to the point where artifacts from the generation process are no longer present? Sure we can have models that detect NSFW content, but what about content that contains information that is false and harmful? Perhaps a resurgence in symbolic AI and rule based reasoning is needed? Or perhaps a renewed interest in the field of argument mining? submitted by /u/sparkinflint [link] [comments]  ( 57 min )
    [D] Recommendation for best toolkit to manually annotate tunnels in 3D Volume?
    I have a 3D tiff stack which contains some holes inside the volume that branch out in various directions. I wanted to annotate the holes inside the volume and I came across several tools that I can use to manually annotate them like 3D slicer, ITK-SNAP, and ImageJ. But I am unfamiliar with all of these tools and I was wondering which one would be most helpful for me? My ultimate goal is to apply volume registration using the annotated holes as keypoints to fuse volumes together. submitted by /u/waterstrider123 [link] [comments]  ( 56 min )
    [D] Visualizations for NSFW models
    Hi all, I am looking for someone to help me for my research project. I want to use grad-CAM (or any other tool) to visualize state-of-the-art cnn predictions like those of Clarifai. submitted by /u/jeditwisted [link] [comments]  ( 54 min )
    [P] A small tool that shuts down your machine when GPU utilization drops too low.
    Hey /r/machinelearning, Long time reader, first time posting non-anonymously. I've been training models using various cloud services, but as an individual user it's stressful for me to worry about shutting down the instances if training fails or stops. Crashes, bad code, etc can cause GPU utilization to drop without the program successfully "finishing", and this idle time can cost a lot of money if you don't catch it quickly. Thus, I built this tiny lil tool to help. It watches the GPU utilization of your instance, and performs an action if it drops too low for too long. For example, shutdown the instance if GPU usage drops under 30% for 5 minutes. It's easy to use and install, just pip install gpu_sentinel If this is useful please leave comments here or on the Github page: https://github.com/moonshinelabs-ai/gpu_sentinel I'm hoping it helps save some other folks money! submitted by /u/nateharada [link] [comments]  ( 62 min )
    [D] I’m a Machine Learning Engineer for FAANG companies. What are some places I can get started doing freelance work for ML?
    I have around 6 YoE doing MLE full time work for various companies. Starting to get tired of working for these big companies and would prefer trying some freelance work. Where are some websites or places I can get started? I’ve seen UpWork, but this seemed more suited for quick one off, software work and less for complex ML tasks last time I was on there (tried that several years ago in 2019). submitted by /u/doctorjuice [link] [comments]  ( 60 min )
    [D] Fine-tuning open source models on specific tasks to compete with ChatGPT?
    As the title says, I'm curious about using open source models like GPT-J, GPT-NeoX, Bloom, or OPT to compete with ChatGPT for *specific use-cases* such as explaining what a bit of code does. ChatGPT does this task quite well, but it's closed-source nature prevents it from being useful in documenting or commenting proprietary code. There's also limitations such as the amount of text ChatGPT will read or respond with. Getting beyond these limitations is something I'm interested in pursuing, perhaps with the help of somewhere in this subreddit. Some assumptions you can safely make: We can get (lots of) funding for the training, hardware, etc... The end product should be on-premises The inference does not actually need to run very quickly. If it costs millions to buy enough GPUs just due to VRAM limitations, we could simply run on CPUs and utilize ram, as long as inference could be done a few times per day. So I guess my questions are where would we start? What model is best to fine-tune? How would you specifically fine-tune to improve specific use cases? submitted by /u/jaqws [link] [comments]  ( 56 min )
    Looking for papers to warm start a BERT large mode from BERT base. Are there papers around it?
    Warm starting the model training of BERT large using BERT base. One idea is to concatenate a bunch of parameters and start training. I was thinking is there a research paper that tries out the best methods? submitted by /u/Plane-Interaction-68 [link] [comments]  ( 52 min )
    [D] Tim Dettmers' GPU advice blog updated for 4000 series
    The legendary Tim Dettmers has updated his blog on which GPU to purchase for Deep learning to include advice for the latest GPU series: https://timdettmers.com/2023/01/16/which-gpu-for-deep-learning/ submitted by /u/init__27 [link] [comments]  ( 60 min )
    [D] The Illustrated Stable Diffusion (Video)
    I'll be honest with you, it took me months to wrap my head around diffusion models. A couple of iterations of a blog post later and this is my best shot at a gentle intro to Stable Diffusion and how it works. https://youtu.be/MXmacOUJUaw The part that took the most reworking is forward diffusion and how to best describe it. Thanks to the many people acknowledged in the blog post who have helped me both understand it and explain it better. Hope you find it helpful. Let me know if you have any questions or feedback. submitted by /u/jayalammar [link] [comments]  ( 57 min )
    [D] Can ChatGPT flag it's own writings?
    My question is, if it is possible to feed a direct quote into ChatGPT and ask it if ChatGPT is the author of said quote? If not, is it reasonable to insist that it can do so in the future? submitted by /u/MrSpotgold [link] [comments]  ( 69 min )
    [R] [2301.00250] DensePose From WiFi
    submitted by /u/GreatCosmicMoustache [link] [comments]  ( 50 min )
    [D] Grid searching data pre processing permutations when training models on structured data.
    Hello, I am currently working on structured data classification problem for work. I was applying multiple different data pre processing steps including imputing null values (mean, KNN, random forrest), adding synthetic data (SMOTE, ADAYSN or None), normalization (l1, l2, max or none), multiple datasets (including different sets of features), as well as different models (XGBoost, Random Forrest, Logistic Regression, KNN, MLP). What I built was a tool that trains all the different permutations of data processing, datasets and models to find the best one, and applied K-Fold cross validation. The tool stores all the data and metrics using MLFlow. This is similar to a grid search across hyperparemeters, but instead of tuning the hyper parameters, I am tuning the data processing steps. I like this method because I gain a level of confidence knowing that I have exhausted all the possible models, data, and pre processing permutations when selecting the best performing model. I was wondering if other people apply a similar technique for structured data problems? Besides the compute is there anything to be cautious of when applying this method? submitted by /u/spiritualquestions [link] [comments]  ( 57 min )
    [D] SOTA on multiple image generation from text
    Wondering what the state of the art is for multiple image generation for an input text, or a series of input texts. To clarify, are there any models or architectures that explore consistency between image generation. (E.g stylistically, same people in the images, same settings, etc) I imagine there would be some pre-existing architectures that could take an image embedding along with a text embedding submitted by /u/weelamb [link] [comments]  ( 56 min )
    [P] Nano GPT
    submitted by /u/trekhleb [link] [comments]  ( 56 min )
  • Open

    Pretraining quadrupeds: a case study in RL as an engineering tool
    submitted by /u/robotphilanthropist [link] [comments]  ( 54 min )
    Is there a publicly available state space model for the Lunar Lander environment?
    The Lunar Lander environment uses the box2d engine to simulate physics. I was wondering if there is code somewhere which explicitly models the environment as as state-space model? LunarLander code: https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py submitted by /u/HazrMard [link] [comments]  ( 57 min )
    Poker (NLH) model?
    Is there any open source model for online poker yet? Of course Pluribus was a big deal a few years ago but it’s closed source (and much has changed since), but with the recent OS Rocket League AI stomping pros I have to wonder why nothing has come to the surface with poker yet. Even a 5% improvement on human play would be a big deal in the long run. Is poker that hard? Or is there some model I’m unaware of? Thanks submitted by /u/enterguild [link] [comments]  ( 54 min )
    Need help creating an action space compatible with stable baselines
    I'm trying to train a bot to play a game and am having trouble creating an action space to handle the inputs, which are the wasd keys, space bar for jumping, left click for shooting, and also two continuous values to indicate the coordinates the mouse should move to. At first, I tried to use spaces.Tuple to combine a MultiDiscrete space for the key presses, and a Box space for the mouse movement. However, I quickly found that none of the stable baseline models support tuples. So I looked online and found an idea to change all of my discrete values to continuous values and round to the nearest integer. This sounded promising, so I created an action space like so: # Game window bounds to provide range mouse can move windowX = self._game_window_bounds[0] windowY = self._game_window_bounds[…  ( 63 min )
    SKRL (reinforcement learning library) version 0.9.0 is now available!
    skrl-v0.9.0 is now available! skrl is an open-source modular library for Reinforcement Learning written in Python (using PyTorch) and designed with a focus on readability, simplicity, and transparency of algorithm implementation. In addition to supporting the OpenAI Gym / Farama Gymnasium, DeepMind, and other environment interfaces, it allows loading and configuring NVIDIA Isaac Gym and NVIDIA Omniverse Isaac Gym environments, enabling agents’ simultaneous training by scopes (subsets of environments among all available environments), which may or may not share resources, in the same run. Visit https://skrl.readthedocs.io to get started!! ​ The major changes in this release are: Added Support for Farama Gymnasium interface Wrapper for robosuite environments Weights & Biases integration Set the running mode (training or evaluation) of the agents Allow clipping of the gradient norm for DDPG, TD3, and SAC agents Initialize model biases Add RNN (RNN, LSTM, GRU, and any other variant) support for A2C, DDPG, PPO, SAC, TD3, and TRPO agents Allow disabling training/evaluation progressbar Farama Shimmy and robosuite examples KUKA LBR iiwa real-world example More benchmarking results Changed Forward model inputs as a Python dictionary [breaking change] Returns a Python dictionary with extra output values in model calls [breaking change] Adopt the implementation of terminated and truncated over done for all environments Fixed Omniverse Isaac Gym simulation speed for the Franka Emika real-world example Call agents' method record_transition instead of the parent method to allow storing samples in memories during the evaluation Move TRPO policy optimization out of the value optimization loop Access to the categorical model distribution Call reset only once for Gym/Gymnasium vectorized environments Removed Deprecated method start in trainers submitted by /u/Toni-SM [link] [comments]  ( 60 min )
    Hyperparameters for pick&place with Franka Emika manipulator
    I'm trying to solve pick&place (and possibly also the other tasks in this repository) with Franka Emika Panda manipulator simulated in Mujoco. I've tried for long with stable_baseline3 but without any results, someone told me to try with RLLib because has better implementation (?), but still I can't find any solution... submitted by /u/riccardogauss [link] [comments]  ( 51 min )
    Best Books to Learn Reinforcement Learning
    submitted by /u/Lakshmireddys [link] [comments]  ( 53 min )
    I'm understanding theory; hard time figuring out how to implement it
    Currently, I'm following David Silver's course along with Sutton and Barto's Introduction to Reinforcement Learning. While these are both fantastic I'm having a hard time thinking of how I can actually implement them in code; mainly getting the environment and agent to be connected. Any help would be appreciated. ​ EDIT: In general I'm also interested in how exactly models stay trained in an environment as I would imagine the program would have to run continuously or else it would have to relearn the task every time. submitted by /u/CaptiDoor [link] [comments]  ( 56 min )
    Question about designing the reward function
    Hi all, I am struggling to design a reward function for the following system: It has two joints, q1 and q2 that can not be actuated at the same time. Once q1 is actuated, the system has to wait for 5 seconds to activate q2. The task is to reach a goal position x and y with the system by interchangeably using q1 and q2. So far the reward function looks like this: reward = 1/(1+pos_error) And the observation vector like this: obs = (dof_pos, goal_pos, pos_error) To make the robot interchangeably use q1 and q2, I use two masks: q1_mask = (1, 0) and q2_mask= (0,1) that are interchangeably used to only actuate one joint at the same time. But I am not sure how to implement the second condition that the system needs 5 seconds to activate q2 after q1. So far I am just storing the time that q1 has been activated and replace the actions by 0: self.actions = torch.where( (self.q2_activation > 0) & (self.q2_activation_time_diff > 5) , self.actions * q2_mask, self.actions ) I think the agent gets irritated by simply as nothing as changed by the actions. How would approach for this problem? submitted by /u/Fun-Moose-3841 [link] [comments]  ( 59 min )
  • Open

    It’s No Big Deal, but ChatGPT Changes Everything – Part I
    Time to catch the ChatGPT craze!  Yes, everyone is flocking to the ChatGPT AI-driven chatbot and asking all sorts of life altering questions such as instructions for removing a peanut butter sandwich from a VCR in Biblical verse in Figure 1 (my answer…you’ve still got a VCR?). Figure 1: ChatGPT in Action! Heck, Ryan Reynolds… Read More »It’s No Big Deal, but ChatGPT Changes Everything – Part I The post It’s No Big Deal, but ChatGPT Changes Everything – Part I appeared first on Data Science Central.  ( 23 min )
  • Open

    Hello! Has anyone tried to create a neural network to find all PI numbers?
    submitted by /u/ScIentIaEstP0tentIa [link] [comments]  ( 58 min )
    Reverse Engineering a Neural Network's Clever Solution to Binary Addition
    submitted by /u/nickb [link] [comments]  ( 55 min )
    🚀Muse:Text 2 Image Generation via Masked Generative Transformers
    submitted by /u/oridnary_artist [link] [comments]  ( 55 min )
    Does the "optimal structure" depend on the size of the sample or just the complexity of the problem?
    Hi everyone. First post in reddit in general. Let me start by saying that I am more or less new to neural networks and self-taught in the subject. I am learning Neural Network as part of my PhD in energy engineering. Basically, I deal with optimisation of energy systems (more precisely, Concentrated Solar Power plants) and I want to use neural networks to create a surrogate model of certain parts of my detailed and time-consuming models to later apply optimisation. So, if I am not wrong, I am dealing with a very classic "function approximation" problem. I want to train a neural network for an specific application. To do so, I gathered a large data set for my detailed model and then trained multiple networks of different number of neurons (considering only one hidden layers right now). As a result, I obtained an optimum number of neurons which is the smallest to achieve a certain error (measured through the RMSE). Now my question: imagine you gather a new data set from the same model, but smaller. Could you assume that the optimum structure (number of neurons) is the same? I acknowledge that if the size of the new data set is too small there could be problems of overfitting but, if you can assume that the new data set, although smaller, is still statistically representative of the problem, wouldn't the optimum structure be the same as the optimum structure is just related to the complexity of the problem? Hope I was clear enough, probably my question is very simple. Thanks! submitted by /u/paworod [link] [comments]  ( 55 min )
  • Open

    Zeta sum vs zeta product
    The Riemann zeta function ζ(s) is given by an infinite sum and an infinite product for complex numbers s with real part greater than 1 [*]. The infinite sum is equal to the infinite product, but which would give you more accuracy: N terms of the sum or N terms of the product? We’ll take […] Zeta sum vs zeta product first appeared on John D. Cook.  ( 5 min )
    Approximating pi with Bernoulli numbers
    In a paper on arXiv Simon Plouffe gives the formula which he derives from an equation in Abramowitz and Stegun (A&S). It took a little while for me to understand what Plouffe intended. I don’t mean my remarks here to be criticism of the author but rather helpful hints for anyone else who might have […] Approximating pi with Bernoulli numbers first appeared on John D. Cook.  ( 5 min )

  • Open

    Emma Myers as an Marvel whatif character
    submitted by /u/oridnary_artist [link] [comments]  ( 43 min )
    I find the timing of the ChatGPT release curious.
    Prior to the ChatGPT release, there were several powerful & similar AI systems already in existence .. but little publicised. I wonder if the ChatGPT release was a commercial ploy to beat the (possibly more capable) Google AI systems such as PALM to the market? Or maybe the release was intended to break the wall of secrecy hiding AI systems from public view? Whatever was behind the release of ChatGPT, at least it may force Google to finally give the public access to their systems too. submitted by /u/MrEloi [link] [comments]  ( 49 min )
    Google may use Deepmind's Sparrow as ChatGPT competitor
    submitted by /u/henlo_there_fren [link] [comments]  ( 46 min )
    how would i effectively train an ai on minecraft building?
    i want to make an ai that you can put a prompt into and it makes a minecraft schematic. this sounds easy to me until you really get into the specifics of it. i could train it on a bunch of schematics with their relative names but there wouldnt really be any sense to it. For example- if you train it on a bunch of dragons and a bunch of sunglasses, and then tell it to build a dragon with sunglasses, it wouldnt know "where" to put the glasses relative to the dragon. ​ Whats the best way to go about this? submitted by /u/cbreauxgaming [link] [comments]  ( 49 min )
    So far I have found this tool (chatGPT) amazing in helping me write code, however...
    submitted by /u/oh_you_so_bad_6-6-6 [link] [comments]  ( 45 min )
    Pruning and Quantizing YOLO V7 With Modoptima
    In this blog, I have explained how you can use my library modoptima to optimize the YOLO v7 model improving the inference speed by 3-10 times on good processors. https://medium.com/@vikasojha894/pruning-and-quantizing-yolo-v7-with-modoptima-19c61aff7301 submitted by /u/VikasOjha666 [link] [comments]  ( 48 min )
    YouTube channel on AI architecture concepts?
    Ive got a Masters degree in AI and I've really been enjoying a channel called 2 minute papers, but it's a bit too much focused on results and doesn't go at all into the key insights of each paper. I'm not looking for tutorials or specific implementation details - I'm interested in the architectures and types of neural networks which were used to obtain these amazing results. Does anybody know a YouTube channel (or something similar) which goes briefly into the core technology of each paper? submitted by /u/BagelOrb [link] [comments]  ( 44 min )
    Could Deepmind’s Sparrow be Google’s answer to ChatGPT?
    submitted by /u/liquidocelotYT [link] [comments]  ( 46 min )
    Artificial Intelligence Best Paper Awards Reviewed by Computer Vision News (and much more)
    Dear all, Here is Computer Vision News of January 2023. It includes reviews of 2 Best Paper Award winning research papers. Read 44 pages about AI, Deep Learning, Computer Vision and more - with code! Read online version for free (recommended) PDF version Free subscription on page 44. Enjoy! https://preview.redd.it/1wgoydszp7ca1.jpg?width=400&format=pjpg&auto=webp&s=1303f337dd627a6f9252d3a341c005e3cb06f433 submitted by /u/Gletta [link] [comments]  ( 44 min )
    Open art space
    Hey Reddit, I am new here. I was wondering if anyone knows if there is a art place in London where we can paint or draw for free? submitted by /u/lucasagazzani [link] [comments]  ( 44 min )
    Should I tell my company about Chat GPT to implement it into our workflow or keep silent about it?
    I work for a public institution in a small country in Europe. I figure it’ll take some time until people will hear about it or even try to implement it. I think it’s inevitable though that some day it’ll rule our job market. I’m currently using AI for my own soon to be start-up and also for mundane tasks. My question is: Should I tell the board of my company about it? Since I’m on a strategic position of some sorts I think it’ll also be great for my reputation to be the first to kickstart it. However, I’m afraid that by opening this pandora’s box I’ll create more competition by showing other people how to use it. I don’t know, what would you do? submitted by /u/Darklan [link] [comments]  ( 56 min )
    What will BMW M and Mercedes AMG Cars Look Like in the Future?
    submitted by /u/BallbustCuck [link] [comments]  ( 44 min )
    Bavaria-based mobility company German Bionic has developed an AI-powered exoskeleton that's designed to help workers carry out physically demanding jobs
    submitted by /u/Rollyman1 [link] [comments]  ( 44 min )
    Excuse me? How is airsoft not following their policy?
    submitted by /u/vajenny_zlacyniec [link] [comments]  ( 44 min )
    CHATGPT D&D Graphic Novel with Dalle-e and Azure Voice-Over
    submitted by /u/erikmalkavian [link] [comments]  ( 53 min )
    Inpainting with the Visuali editor (beta)
    submitted by /u/aigeneration [link] [comments]  ( 46 min )
    AI for your own files
    Hi there, I was wondering if there is some sort of tool out there that allows you to have some sort of localised AI - as in it searches all my files, but also within them. Say theres a poerpoint file and one of the slides has some relative content, it will show that slide and not just the file itself. I run a consultancy and have 1000s of documents from past campaigns and clients, it would be great to be able to teach some form of AI about my content and then it finds things or creates things when I need it.... thanks! submitted by /u/Vincenth2008 [link] [comments]  ( 47 min )
    How ChatGPT would have changed my life
    submitted by /u/DeeMore [link] [comments]  ( 61 min )
    Box2Mask: A Unique Method for Single-Shot Instance Segmentation that Combines Deep Learning with the Level-Set Evolution Model to Provide Accurate Mask Predictions with only Bounding Box Supervision
    submitted by /u/ai-lover [link] [comments]  ( 48 min )
    AI-Developed, Synthetic DNA is About to Revolutionize Drug Production and Gene Therapy
    submitted by /u/digitalgoldnow [link] [comments]  ( 43 min )
    Build a simply GPT-3 chatbot in Python in 20 lines of code in 5 minutes
    submitted by /u/techie_ray [link] [comments]  ( 43 min )
    AI text to art generation explained simply with pen and paper
    submitted by /u/techie_ray [link] [comments]  ( 46 min )
    Unpopular opinion: AI will make jobs more boring
    One claim I hear a lot in AI circles is that in the future AI will replace a lot of bullshit jobs, which will free up time for humans to focus on what matters. No more transcribing emails for a living. New, meaningful jobs will be created instead. In other words, the hope is that in 10 years more people will be psychologically happy with their jobs than today. I'm growing wary of the opposite risk. If we consider "bullshit" the part of a job that can be automated away, then sometimes the bullshit part of a job is what makes it pleasurable and fulfilling (provided it is not the only thing). As an artist you may draw pleasure in doing coloring after a pencil sketch. As an engineer, you may like a day without planning or meetings where you can focus on programming a small piece of code. As a blog post writer you may enjoy, well, writing. AI are quickly becoming able to replace all these tasks, and to stay competitive people will increasingly be required to offload parts of their jobs to them. I can see a future where jobs in any field start looking the same - the person is in charge of having higher knowledge about the problem, planning tasks for AI, be able to evaluate AI output, and assemble final product. This is certainly not a bullshit job, but I also think that for many people this is not going to be a fulfilling job. submitted by /u/R_y_n_o [link] [comments]  ( 46 min )
    Our horror game story is crafted with ChatGPT and this scene is part of it. What do you think about it?
    submitted by /u/Leaderide [link] [comments]  ( 45 min )
    About art AIs, how noise works?
    I have heard that art AI converts images into "noise" to create their models. Question: Can you revert the "noise" into the image it was based into? E.G I take the mona lisa, convert it into "noise" so the AI can understand it. Can I then request the AI to use the mona lisa "noise" to generate the mona lisa back again? Or there is no way to return it back to the image after it is converted into noise? (not the exact image, but its equivalent based into the noise stored). submitted by /u/___Marshmallow___ [link] [comments]  ( 45 min )
  • Open

    Emma Myers as an Marvel whatif character
    submitted by /u/oridnary_artist [link] [comments]  ( 53 min )
    Are there any DNNs that perform gradient descent in runtime (mesa alignment)?
    I was watching the Robert Miles series on AI alignment, specifically the one about mesa optimizers, and when he started talking about the mesa objective I noticed that the way in which he defined the base objective (gradient descent + loss function + train data) it's not analogous to the way in which the mesa objective would work (as the latter cannot change any parameters of the network, as no Gradient Descent step is used or anything analogous). This got me thinking if are there any papers that implement runtime parameter change, through gradient descent or not. submitted by /u/not-alredy-taken [link] [comments]  ( 55 min )
    Electronic circuits analogies with Deep Neural Networks
    I was wondering these days about the training parameter in tensor flow layers, and stumbled upon the idea of letting the activation function itself be designed in such a way that it’s backward pass yields no gradients, independent (or almost) of the loss in use. Do you know any? Then it came to me that this resembles a buffer op amp, and I was wondering if are there out there any papers that explore circuit analogies with the training process, treating the inputs and gradients as “currents”. Seems like an interesting concept ! submitted by /u/not-alredy-taken [link] [comments]  ( 58 min )
  • Open

    [D] What kinds of interesting models can I train with just an RTX 4080?
    I'm aware transformers are pretty vram hungry and a 4080 only has 16 GB. So I am guessing a lot of transformer based models will be out of the question. At least anything that is interesting. Not sure about other models though. Is there anything I can do with a 4080 that's beyond just some toy experiment? submitted by /u/faker10101891 [link] [comments]  ( 55 min )
    [D] What is standard practice in RL when reporting average returns across multiple seeds in a table or a plot?
    Hey everyone, This may be a silly question but I'm confused as to what standard practice is when reporting average returns across multiple seeds in a table or a plot. It's usually not even mentioned but I sometimes see authors mention they are using: Average ± Standard Deviation Average ± Standard Error Average ± 1.96 * Standard Error Bootstrapped CIs For example, this paper (https://www.jmlr.org/papers/volume23/21-1342/21-1342.pdf) by the authors of clearnrl doesn't specify anything other than that the "reported numbers are the final average episodic returns of at least 3 random seeds". What would you consider best practice in RL? submitted by /u/thekingpenguin3 [link] [comments]  ( 57 min )
    [P] Parameter optimization on a guitar amp emulation
    So I'm working on project which goal is to digitally emulate guitar tube amplifiers using a Wiener-Hammerstein model. For those of you unfamiliar with this type of model, its key block is a nonlinear block that is characterized by a set of 8 parameters. Basically there's a raw input guitar signal and there's an output signal that should be as close as possible to the actual output of the modeled amplifier. I have a database with a series of magnitude-variable chirp signals serving as inputs and the respective output measurements. So my question is what is the best way for me to automate the process of optimization of this set parameters. I thought of using a genetic algorithm but I wondered if that's the most accurate and efficient way of doing it. Also this is to be implement on a microcontroller so I have more limited resources than a computer. However, it would be really cool to be able to customize these parameters in real time on my Teensy 4.0 so it would be ideal that the algorithm could meet this condition, although it's not completely necessary submitted by /u/syko101 [link] [comments]  ( 57 min )
    [P] Summaries of Ten Interesting & Influential Papers I read in 2022
    submitted by /u/seraschka [link] [comments]  ( 91 min )
    [Discussion] can't lives down on won't street
    submitted by /u/Psqwanio [link] [comments]  ( 53 min )
    [D] AI Security - Gumroad AI help bot refuses to answer certain questions... unless you PRIME it with a question that it WILL answer first
    submitted by /u/LatentWeb [link] [comments]  ( 53 min )
    [P] Modified kmeans algorithm returns the wrong answer
    I am trying to create a kmeans algorithm that is based on the Earth Movers Distance instead of the Euclidean distance. However, when I run it, it just returns the same value for all data points. ​ The input is an dxn matrix containing all of my n probability distributions. ​ Here is an example of running the algorithm. The clusters should be much more distributed. ​ distribution = {} num_bins = 5 for i in data: distribution[i] = np.histogram(data[i], bins = num_bins)[0] / len(data[i]) ​ Z = np.zeros((len(data), num_bins)) for i in range(len(Z)): Z[i] = distribution[list(distribution)[i]] Z = Z.T ans = k_means_algorithm(Z, 8, proportionally_random_k) ​ res = points_to_clusters(Z, ans) print(res) ​ [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. …  ( 58 min )
    [R] hlb-CIFAR10 0.2.0: New world record for single-GPU CIFAR10, ~<12.38s with one A100 (SXM4, Colab)
    submitted by /u/tysam_and_co [link] [comments]  ( 68 min )
    Best Predictive Model to Predict Total Monthly Stock Returns on Panel Data [P]
    Hello Reddit community, I am faced with some Finance coursework and I would appreciate any help or guidance from experienced practitioners in the ML/Finance industry. I have a panel dataset (Data File) - that includes information such as date, stock name, market-cap, sector, and a column for the target variable which is the monthly absolute return of the stock in percentage terms (i.e one month from the date column). I am tasked with building a predictive model to forecast the target variable. I would like to know information on which ML model you would recommend, and why? Thank you in advance for any help or guidance provided. submitted by /u/RhiteousRhino [link] [comments]  ( 56 min )
    [P] Free PyTorch Deep Learning class, from Perceptrons to multi-GPU training and cloud deployment
    submitted by /u/seraschka [link] [comments]  ( 55 min )
    [D] Problem with predict/evaluate and batch_size keras
    So, i trained my unet using keras. Best dicescore of saved model supposed to be 0.817. Then i ran a hand-made score prediction: batch_size = 1 n_val_img = len(os.listdir(os.path.join(fp2,"sujetos"))) vspe = n_val_img//batch_size dice = 0 for _ in range(0,vspe): test_image_batch, test_mask_batch = val_gen_ds.__next__() for i in range(test_image_batch.shape[0]): a = my_unet(np.expand_dims(test_image_batch[i], 0)).numpy() predicted_img_th = (a[0,:,:,0]>0.5)*1 dice += Dice(test_mask_batch[i],predicted_img_th).numpy() print(dice/n_val_img) This return around 0.836, so... already different. Then i tried to replicate my evaluate score with different batch sizes: my_unet.evaluate(val_gen_ds,batch_size=batch_size,steps=vspe) ​ https://preview.redd.it/pmt5b358e8ca1.png?width=567&format=png&auto=webp&s=bbd24505bdb55b66a3cab12007cc8d628e533907 Clearly, it doesn't make sense to me... What's wrong? what am i missing? submitted by /u/SerDetestable [link] [comments]  ( 58 min )
    [R] HYPERREAL — high fidelity 6dof video with ray-conditioned sampling
    submitted by /u/SpatialComputing [link] [comments]  ( 55 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 52 min )
    [N] Will w3 see lower costs and faster training? Will Floating Point 8 Solve AI/ML Overhead?
    submitted by /u/RuairiSpain [link] [comments]  ( 53 min )
    [D] Time Embedding in Diffusion Model
    I was looking at how time is embedded in diffusion models, and I found these two implementations [1] and [2]: The first one is a simplified version of the second one, but the idea behind the time embedding is similar. What I've understood is that t is a number, it goes in a SinusoidalPositionEmbeddings with a given time_dim, then Linear + ReLU where the same dimensions are kept. Then for each down-step of the UNet, an additional Linear + ReLU is performed to match the channels of the image embedding, and this latter embedding is added to the output of the CNN. Here when have the time embedding with a shape of (b, c, 1, 1) and the image embedding with a shape of (b, c, h, w). When we perform the addition, the time embedding is broadcasted to match the image embedding. As far as I understand, here the latent space of the image gets reweighted channel-wise, but the same weights are added for each different position. Why they did follow this choice? This is quite different from the standard positional encoding used e.g. in self-attention, where the positional embedding gives a different weight to each spatial dimension. I never found this detail explained in any Diffusion Model paper/tutorial, also if we look at [2], the same idea is made more complex, with more Linear projections and different activation functions (GeLU and SiLU). Moreover, I'm not sure about the difference between applying a time embedding and then directly a conv2d layer against the time embedding + attention + conv2d. Aren't these types of embedding suited up for attention layers? How does a Conv2D layer, which is positional invariant by construction, benefit from this type of operation? [1] https://colab.research.google.com/drive/1sjy9odlSSy0RBVgMTgP7s99NXsqglsUL?usp=sharing#scrollTo=KOYPSxPf_LL7 [2] https://github.com/lucidrains/denoising-diffusion-pytorch/blob/main/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py submitted by /u/Lumett [link] [comments]  ( 60 min )
    [P] I built an app that allows you to build Image Classifiers completely on your phone. Collect data, Train models, and Preview the predictions in realtime. You can also export the model/dataset to be used anywhere else. Would love some feedback.
    submitted by /u/Playgroundai [link] [comments]  ( 63 min )
    [P] I built arxiv-summary.com, a list of GPT-3 generated paper summaries
    Hi there, I wanted to share my new project with you, it is called arxiv-summary.com. Right now, I find it really difficult to keep up with all the important new publications in our field. Especially, it is sometimes difficult to get an overview of a paper to decide if it's worth reading. I really like arxiv-sanity by Andrej Karpathy, but even with that, it can still take some time to understand the main ideas and contributions from the abstract. With arxiv-summary, my goal is to make ML research papers more "human-parsable". The website works by fetching new papers daily from arxiv.org, using PapersWithCode to filter out the most relevant ones. Then, I parse the papers' pdf and LaTeX source code to extract relevant sections and subsections. GPT-3 then summarizes each section and subsection as bullet points, which are finally compiled into a blog post and uploaded to the site. You can check out the site at arxiv-summary.com and see for yourself. There's also a search page and an archive page where you can get a chronological overview. If you have any feedback or questions, I'd be happy to hear them. Also, if you work at OpenAI and could gift me some more tokens, that would be much appreciated :D Thanks and happy reading! submitted by /u/niclas_wue [link] [comments]  ( 65 min )
    [Project] Introducing Visionner (Your image dataset toolkit)
    submitted by /u/charles_data_dev [link] [comments]  ( 58 min )
    [P] C++ wrapper around libsvm and liblinear using Eigen
    I needed a C++ wrapper library around libsvm and liblinear using Eigen so I made one. Maybe it's useful for you as well: https://github.com/bloomen/svmegn submitted by /u/cblume [link] [comments]  ( 56 min )
    [D] Packing multiple shorter training examples in to single sequence in LM pretraining
    I've come across several papers where authors mention they, for reasons of computational efficiency, will pack multiple shorter training examples together in a single sequence. Example from "Scaling Instruction-Finetuned Language Models" (Chung et al, 2022): We use packing (Raffel et al., 2020) to combine multiple training examples into a single sequence, separating inputs from targets using an end-of-sequence token. Masking is applied to prevent the tokens from attending to others across the packed example boundary. I'm curious to understand how this is actually done in practice. Seeing as multiple separate masks are involved, I'd think one would need to loop over them all and repeat (?) the matrix multiplication several times? Is there some built in functionality in Pytorch and other frameworks to deal with a situation like this with multiple masks? Thankful if someone could share and explain, or link to an implementation of input packing. submitted by /u/mLalush [link] [comments]  ( 59 min )
  • Open

    Online Reinforcement Learning Courses
    Can anyone recommend any online Reinforcement Learning courses, preferably those that have assignments or exercises that are graded so there is some way to check your answers? submitted by /u/Smart-Ground-3587 [link] [comments]  ( 55 min )
    CS234 Stanford
    Can you submit the programming assignments of CS234 without using gradescope? How do you check your work if you are taking the course without being enrolled in it officially? submitted by /u/Smart-Ground-3587 [link] [comments]  ( 55 min )
    Boardgame environment and data sources [help]
    Hi everybody! I'm trying to come up with a fun learning project to execute. I'd like to try to create an agent that plays a boardgame. My plan is to apply reinforcement learning and behavioral cloning. I don't know what boardgame to work with yet, I just want it to be simple enough to work with and fulfill two requirements: The environment must be implemented and accessible. I must have access to a dataset containing historical plays. Would anybody have suggestions on where I could find these? Any responses are appreciated. Thank you! submitted by /u/valahart [link] [comments]  ( 55 min )
    In Asynchronous n-step DQN, is there a global shared gradient vector or gradient vector for each thread?
    In this paper: *Asynchronous Methods for Deep Reinforcement Learning (arxiv.org) This is the pseudocode for n-step DQN: https://preview.redd.it/rjt84n6zs9ca1.png?width=1176&format=png&auto=webp&s=eda075c5741bae96f954432073b0d6617937941a In the pseudocode above it mentions: "Initialize network gradients dtheta <-- 0." Is this referring to a global shared gradient vector or a gradient vector for each thread? I noticed that they use theta instead of theta' making me think it is a global shared gradient vector. But if this is the case, couldn't a thread clear the gradient vector while another thread is accumulating gradients? Also, in section 7 of the paper, where they talk about implementing SGD with Momentum in an Asynchronous setting it seems to imply that it is a global shared gradient vector normally. https://preview.redd.it/souf3gxkt9ca1.png?width=1152&format=png&auto=webp&s=4a0cff8ba6100c651d0ed8b7d34089ec2e3aecd5 submitted by /u/ImNotKevPlayz [link] [comments]  ( 56 min )
    What is data governance and its importance to a company?
    submitted by /u/ranjeettechnincal [link] [comments]  ( 54 min )
    How to interpet negative entropy_loss that keeps decreasing in PPO?
    Why is it negative? Why it keeps decreasing? Will it stop at any point? Is this expecred behavior? If not what should I adjust? I am using Stable Baselines3 ​ https://preview.redd.it/bypjlwy5j9ca1.png?width=372&format=png&auto=webp&s=94fb5fe55b66c90d705c6776ffa5dd258e90d2ba https://preview.redd.it/f0ooky84j9ca1.png?width=1413&format=png&auto=webp&s=0b62f9d8df555835ce72cfe41cfd840c295b70ca submitted by /u/andrew7777777 [link] [comments]  ( 54 min )
    Help with transitioning an existing DQN into a DRQN
    Hi RL reddit, To preface this post, please let me know if I need to clarify any details to receive help and/or guidance. I am new to posting on this subreddit and still consider myself a novice in the deep RL domain. What I need help with is transitioning an existing DQN into a DRQN. The DQN architecture and the environment that it learns comes directly from this paper https://arxiv.org/pdf/1810.04244.pdf To briefly summarize the paper, the author proposes a DQN network as a controller to guide fixed winged aircrafts to follow the evolution of a spreading wildfire (grid environment). The same DQN can be used for the both aircrafts. The inputs are follows: A 5 dimensional vector bank angle of ownship distance to other aircraft bearing angle to other aircraft relative to current he…  ( 66 min )
    Best practices for Self-Play RL
    Hi! I know there is a lot of work on self-play (training RL agents in environments where they play against themselves), and I've found several tricks to stabilize the training process. I was wondering if someone who has experience in this field could provide a compilation of such tricks and best practices, for example: Fictitious self-play: Keeping N previous checkpoints of the agent and sampling from a pool of these to select an opponent every T environment steps. (I have also seen people sample a new opponent after every single env.reset() call, what do you think is best?). What is a reasonable value for T? and for N? KL distillation loss: Adding a KL loss penalty between the current agent being trained and the last checkpoint stored so that the policy doesn't change abruptly. How is this usually implemented? What's a reasonable value for a coefficient for that penalty? Imagine a DQN agent playing against itself, is it reasonable to set epsilon=1 and start annealing it every time a new enemy is sampled? (in case we play against the same enemy for a long time). ​ There might be many more tricks so if we can list them all here that'd be great! Thank you all! submitted by /u/xWh0am1 [link] [comments]  ( 61 min )
  • Open

    Reverse engineering options
    This weekend I saw a sign in the window of a Burger King™ that made me think of an interesting problem. If you know the number of possibilities like this, how would you reverse engineer what the options that created the possibilities? In the example above, there are 211,184 = 213×33 possible answers, and so […] Reverse engineering options first appeared on John D. Cook.  ( 6 min )

  • Open

    Interesting website for people struggling with the concept of Roko's basilisk
    submitted by /u/Fusionism [link] [comments]  ( 45 min )
    Best Artificial Intelligence books for beginners to Experts to read
    submitted by /u/Lakshmireddys [link] [comments]  ( 44 min )
    Is there a tool (or research paper) for training a model on images for inpainting?
    I want to train a ML model on an object and have it be able to inpaint the specific object into other photos. Does a service like this exist? submitted by /u/Zestybeef10 [link] [comments]  ( 45 min )
    AI Sales Chat Bot and Telephone Agent
    Hi there, I'm looking for AI software that can do the following tasks: A fully automated chat bot that can be used to close deals. A telephone bot that can do outreach and also sense if a lead is interested or not. Basically I want to be able to automate the whole outreach + close process by AI so that I can process much more contact data as human sales reps could ever do. If a lead is interested, they should receive a link with an offer where they can directly purchase the digital product that I'm planning to sell. submitted by /u/cokedinosaur [link] [comments]  ( 44 min )
    AI Etsy shop
    Hey guys , I got into the ai space about 3 weeks ago and am stating my Esty shop journey with mainly air created photos of supercars. I would just love some feedback as to what I can add and how to make it as best as possible. I am planning on bringing in large metal canvas soon. Just really wanting feedback on the art work. Thank you!! https://picturetron.etsy.com submitted by /u/BetterPresentation35 [link] [comments]  ( 45 min )
    What is the future of NLP for the coming 24 months? Dall-E clones Mid-Journey and SD took 6-8 months to appear, so is that how long it will take for clones of ChatGPT? Perhaps less time given the higher investment and market potential?
    submitted by /u/MegavirusOfDoom [link] [comments]  ( 50 min )
    Spray-on smart skin uses AI to rapidly understand hand tasks
    submitted by /u/qptbook [link] [comments]  ( 44 min )
    Which A.I. Labs are mostly likely to have offshoots that scale well?
    With ChatGPT and GPT-4 hype on full throttle I anticipate a lot of new A.I. labs and off-shoots of OpenAI and Google Brain and DeepMind will keep forming. Speaking of which, Niki Parmar and Ashish Vaswani, two prominent artificial intelligence researchers who left Google in 2021 to launch Adept , have now departed to make yet another A.I. lab in stealth mode. By 2024, there will be around a dozen good A.I. labs not named OpenAI. Google itself has PaLM with RLHF, Chinchilla, Google Duplex, LaMDA and Sparrow (DeepMind). More foundational models will arrive as A.I. labs make their models public. But who and which one is likely to be good? Anthropic, Adept, Inflection A.I. AI21 Labs, Cohere, there are a lot of potential candidates. This is not counting the ones likely forming in China, a…  ( 49 min )
    Could AI be the answer to our content needs by 2025?
    submitted by /u/Realistic-Plant3957 [link] [comments]  ( 44 min )
    I Created A Reddit Chatbot..(Potentially offensive content)
    submitted by /u/TheRPGGamerMan [link] [comments]  ( 48 min )
    What practical applications have you already found for ChatGPT?
    submitted by /u/DrMelbourne [link] [comments]  ( 46 min )
    ChatGPT will undoubtedly change the world. The question is HOW? What are your thoughts?
    submitted by /u/DrMelbourne [link] [comments]  ( 46 min )
    AI that you give an image to and it says colors that can compliment it?
    Title says it all, is there an AI that you can insert in an image and it says some colors that could work with it? submitted by /u/NewShibeAccount [link] [comments]  ( 46 min )
    Do you want to understand how an end-2-end paraphrase app can be created?
    Check out this medium article about how to create such an app: https://medium.com/towards-artificial-intelligence/how-to-create-an-end-2-end-text-paraphrase-app-db83a4e05918 Or check this repository about Quotera a paraphrase app to be deployed via Streamlit or FastAPI and Docker in Python: https://github.com/stavrostheocharis/quotera submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 45 min )
    THIS BOOK WAS WRITTEN ENTIRELY BY CHATGPT.
    submitted by /u/__sandeepan__ [link] [comments]  ( 43 min )
    Bagging vs Boosting Explained
    Hi guys, I have made a video on YouTube here where I cover the Bagging and Boosting ensemble learning algorithms. I present how both work, and discuss their similarities and differences. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 47 min )
    Top A.I. Powered Tools Not Named ChatGPT
    submitted by /u/BackgroundResult [link] [comments]  ( 46 min )
    The misuse of AI is the familiar promise one thing and deliver something else
    submitted by /u/shanoshamanizum [link] [comments]  ( 45 min )
    Will there be a universal AI tool?
    When thinking about the possibilities of artificial intelligence, I couldn't stop thinking about some sort of interactive Wikipedia, imagine all the objective knowledge of the universe compiled into a firmware pre-installed app just like the 'Calculator'. Another type of device that could become an incredible tool for solving problems, might even be useful for business or professionals. We could name it the 'Communicator', just like a calculator you enter endless math problems, with the communicator you could enter endless word problems. I think we could grab much more productivity out of this technology than just asking philosophical questions to Chat GPT. submitted by /u/jvazorka03 [link] [comments]  ( 45 min )
    I wrote a blog post about use of AI in Music Industry. Take a Look!
    https://link.medium.com/HnMPSMA4zwb submitted by /u/Yigit_im [link] [comments]  ( 44 min )
    An interactive AI training simulation using Genetic Algorithm
    submitted by /u/SparshG [link] [comments]  ( 44 min )
    Bright Eye: free Mobil AI app that generates art, code, poems and essays, and more.
    Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. We’ve released a new feature called AAIA(Ask AI Anything), which is capable of answering all types of questions, even requests to generate literature (fantasy, folklore, drama, fiction, fable, etc). It’s sort of like chat-gpt. We’d appreciate it if you could try it out and let us know your thoughts: https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/True-Marketing-5079 [link] [comments]  ( 46 min )
    I'm Compiling a List of Helpful AI Tools, Feel free to add any you've created/discovered
    submitted by /u/secret-millionaire [link] [comments]  ( 45 min )
    We created a list of AI projects and applications in Github
    There are incredible applications built using AI. It is definitely a trend that the world should not ignore. We started to maintain a collection of cool ai projects in Github: https://github.com/ai-collection/ai-collection Our mission is to increase reach and visibility for these awesome projects! It is updated daily and we hope that with the help of the community, it will be a great source for discovering AI applications. submitted by /u/beth0io [link] [comments]  ( 46 min )
  • Open

    Best Neural Networks Courses on Udemy to Consider
    submitted by /u/Lakshmireddys [link] [comments]  ( 52 min )
    New Abilities Emerge If Language Models Are Scaled Past Critical Point ⭕
    Last year, large language models (LLM) have broken record after record. ChatGPT got to 1 million users faster than Facebook, Spotify, and Instagram did. They helped create billion-dollar companies, and most notably they helped us recognize the divine nature of ducks. 2023 has started and ML progress is likely to continue at a break-neck speed. This is a great time to take a look at one of the most interesting papers from last year. Emergent Abilities in LLMs In a recent paper from Google Brain, Jason Wei and his colleagues allowed us a peak into the future. This beautiful research showed how scaling LLMs might allow them, among other things, to: Become better at math Understand even more subtleties of human language reduce hallucination and answer truthfully ... (See the plot o…  ( 77 min )
    Bagging vs Boosting Explained
    Hi guys, I have made a video on YouTube here where I cover the Bagging and Boosting ensemble learning algorithms. I present how both work, and discuss their similarities and differences. I hope it may be of use to some of you out there. As always, feedback is more than welcomed! :) submitted by /u/Personal-Trainer-541 [link] [comments]  ( 56 min )
    Help visualizing Spiral dataset
    Hello I want to visualize neural network learning to fit the spiral data set, I am able to plot the given data. What i want is to see is how my neural net is adjusting to fit data points , forming spiral shape, The problem is i don't know what to plot or how to do it, My softmax function outputs probability distribution and my categorical function outputs loss, so i'm lost. submitted by /u/Purple_Gen3 [link] [comments]  ( 55 min )
  • Open

    [D]: Are there models like CODEX but work in a reversed way?
    Many models these days focus on code generation. But I was wondering if there's anything for understanding existing codebase? I know that Codex or ChatGPT can understand what a function does, but what about a complex codebase with imports and nested calls? Are these models capable of understanding the relationship between functions? I'm trying to build a side project where you give it a production level codebase, it does some magic, then I can ask AI anything about things in this codebase with high accuracy. submitted by /u/GoodluckH [link] [comments]  ( 58 min )
    [D] Leveraging multiple photos for super resolution / restoration
    Is there any work that does this? Let's say you have 20 good, related photos, and one bad one you want to restore / upscale / denoise / sharpen / inpaint etc. Those 20 good images of the same person / object / building should give the model a good sense of how to fill in missing details in the bad photo. I imagine there's work somewhere in this direction but can't find anything. submitted by /u/anonDogeLover [link] [comments]  ( 55 min )
    [P] handlingclassifier.ml - predicts size category from provided product name - works with IKEA-like range of products
    submitted by /u/curryprogrammer [link] [comments]  ( 55 min )
    [P] Question regarding ID3 and cross validation
    I created an ID3 algorithm using scilab for a project at university. The project is more a proof of concept then an having an actual usecase. Its written in scilab without using any toolboxes and classifies if you have won in tic tac toe. My code basically uses a dataset that has every possible endgame board configuration of tic tac toe and builds a descision tree. I can then input a specific endgame board configuration and it tells me if i won or not. It works fine so far and since my dataset has all possible configurations, predicts the correct label 100% of times. I now added a ten times k-fold cross validation algorithm. However, the validation only gives me an accuracy of about 80%. Am I missing something here? Does a cross validation even make sense if my training set contains all possible data points? Hope someone can give some answers. submitted by /u/i-dunnodude [link] [comments]  ( 57 min )
    [R] Photorealistic human image editing using attention with GANs
    submitted by /u/psarpei [link] [comments]  ( 58 min )
    [D] Speaker diarization: reusing fitted speaker embedding clusters?
    I am trying to create speaker-aware transcripts from (multiple) audio files of a podcast. Right now I'm using OpenAI Whisper for the transcripts and pyannote.audio for speaker diarization (speaker segmentation + centroid clustering) In order to speed up the process (diarization time doesn't seem to scale linearly), I'd like to fit the centroids with the first audio file, and use those to predict the speakers (clusters of the speaker embeddings) of the other audio files, as the speakers don't change across episodes. However, the default pyannote.audio diarization pipeline refits the clusters for each audio file. Do you know of any other Python framework that allows reusing the fitted clusters, or any way pyannote.audio allows this? Is this even possible? Any other way to achieve the desired results? submitted by /u/2blazen [link] [comments]  ( 57 min )
    [Project] Stable Diffusion Pokémon cards
    submitted by /u/thundergolfer [link] [comments]  ( 62 min )
    [R] Differentiable Point-Based Radiance Fields for Efficient View Synthesis
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 55 min )
    [R] from a human motion sequence, SUMMON synthesizes physically plausible and semantically reasonable objects
    submitted by /u/t0ns0fph0t0ns [link] [comments]  ( 62 min )
    [R] Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement - TeachMe - Bhavana Dalvi Mishra et al Allen Institute for AI
    Paper: https://arxiv.org/abs/2204.13074 Blog: https://blog.allenai.org/towards-teachable-reasoning-systems-dd16659fd9f8 Youtube: https://www.youtube.com/watch?v=c5j_tWsENFg Abstract: Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of user feedback, containing user-supplied corrections to erroneous model beliefs that users identify during interaction. Retrievals from memory are used as additional context for QA, to help avoid previous mistakes in similar new situations - a novel application of memory-based continuous learning. With simulated feedback, we find that our system (called TeachMe) continually improves with time, and without model retraining, requiring feedback on only 25% of training examples to reach within 1% of the upper-bound (feedback on all examples). Similarly, in experiments with real users, we observe a similar trend, with performance improving by over 15% on a hidden test set after teaching. This suggests new opportunities for using frozen language models in an interactive setting where users can inspect, debug, and correct the model's beliefs, leading to improved system's performance over time. https://preview.redd.it/umosmtgzj0ca1.jpg?width=507&format=pjpg&auto=webp&s=65f66ed1230ae6bdce73a86386fbfcc860cd4c59 https://preview.redd.it/5lbhvwgzj0ca1.jpg?width=680&format=pjpg&auto=webp&s=99b402f6db395a62756113b2f8cb879667d444ef https://preview.redd.it/jd7oaygzj0ca1.jpg?width=1308&format=pjpg&auto=webp&s=4fac23ba2ecd68eb5b744f6c10a9c09ba604376c https://preview.redd.it/q137kkhzj0ca1.jpg?width=839&format=pjpg&auto=webp&s=24052a161029b23b32944954f88f432348016ea0 submitted by /u/Singularian2501 [link] [comments]  ( 61 min )
    [N] Class-action law­suit filed against Sta­bil­ity AI, DeviantArt, and Mid­journey for using the text-to-image AI Sta­ble Dif­fu­sion
    submitted by /u/Wiskkey [link] [comments]  ( 76 min )
    [D] What's hot for Machine Learning Research in 2023?
    Which of the sub-fields/approaches, application areas are expected to gain much attention (pun unintended) this year in the academia? PS - Inspired from a similar question last year (https://www.reddit.com/r/MachineLearning/comments/t04ekm/d_whats_hot_for_machine_learning_research_in_2022/) submitted by /u/Aromatic_Eye_6268 [link] [comments]  ( 57 min )
    [D] Is MusicGPT a viable possibility?
    As in, "Pink Floyd, Another Brick in the Wall, ska, heavy trumpet, female vocalist" It seems that if copyright issues are a controversial element of AI art, then copyrighted music will run into the same issue. Or is this not true? submitted by /u/markhachman [link] [comments]  ( 61 min )
  • Open

    Is there a better server alternative than AWS/Azure/Nvidia...for students?
    I'm a student and I've gotten to the part of my machine learning project where I need to optimize a lot. When I say a lot, I mean a lot, I have complex models. Most people these days usually pay for the services of "big tech companies" like Amazon, Microsoft, etc. to get their models trained. But I think in my case it would cost a lot of money, all tho I am aware that some have student discounts. Are there any alternatives like universities that allow students to do this or something else entirely? If not, which of these companies would you recommend best in terms of computing/price ? Thanks for all the replies submitted by /u/Apprehensive_Rush314 [link] [comments]  ( 54 min )
    _Rocket League_ RL agent 'Nexto' now in top 0.5% of players
    submitted by /u/gwern [link] [comments]  ( 52 min )
  • Open

    Foreshadowing Page Rank
    Douglas Hofstadter, best known as the author of Godel, Escher, Bach, wrote the foreword to Clark Kimberling’s book Triangle Centers and Central Triangles. Hofstadter begins by saying that in his study of math he “sadly managed to sidestep virtually all of geometry” and developed an interest in geometry, specifically triangle centers, much later. The ancient […] Foreshadowing Page Rank first appeared on John D. Cook.  ( 6 min )
  • Open

    Novelty Socks by AI
    I like a fun sock. The more random the design, the better. What kinds of novelty sock ideas would we get if we used AI as a creativity aid? It turns out they aren't too novel unless the AI is glitchy. I collected 14 examples of socks I  ( 6 min )
    Bonus: More novelty socks
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v4 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.  ( 2 min )
    Bayesian inference via sparse Hamiltonian flows. (arXiv:2203.05723v2 [stat.ML] UPDATED)
    A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during Bayesian inference, with the goal of reducing computational cost. Although past work has shown empirically that there often exists a coreset with low inferential error, efficiently constructing such a coreset remains a challenge. Current methods tend to be slow, require a secondary inference step after coreset construction, and do not provide bounds on the data marginal evidence. In this work, we introduce a new method -- sparse Hamiltonian flows -- that addresses all three of these challenges. The method involves first subsampling the data uniformly, and then optimizing a Hamiltonian flow parametrized by coreset weights and including periodic momentum quasi-refreshment steps. Theoretical results show that the method enables an exponential compression of the dataset in a representative model, and that the quasi-refreshment steps reduce the KL divergence to the target. Real and synthetic experiments demonstrate that sparse Hamiltonian flows provide accurate posterior approximations with significantly reduced runtime compared with competing dynamical-system-based inference methods.  ( 2 min )
    Manifold Fitting under Unbounded Noise. (arXiv:1909.10228v2 [stat.ML] UPDATED)
    There has been an emerging trend in non-Euclidean statistical analysis of aiming to recover a low dimensional structure, namely a manifold, underlying the high dimensional data. Recovering the manifold requires the noise to be of certain concentration. Existing methods address this problem by constructing an approximated manifold based on the tangent space estimation at each sample point. Although theoretical convergence for these methods is guaranteed, either the samples are noiseless or the noise is bounded. However, if the noise is unbounded, which is a common scenario, the tangent space estimation at the noisy samples will be blurred. Fitting a manifold from the blurred tangent space might increase the inaccuracy. In this paper, we introduce a new manifold-fitting method, by which the output manifold is constructed by directly estimating the tangent spaces at the projected points on the underlying manifold, rather than at the sample points, to decrease the error caused by the noise. Assuming the noise is unbounded, our new method provides theoretical convergence in high probability, in terms of the upper bound of the distance between the estimated and underlying manifold. The smoothness of the estimated manifold is also evaluated by bounding the supremum of twice difference above. Numerical simulations are provided to validate our theoretical findings and demonstrate the advantages of our method over other relevant manifold fitting methods. Finally, our method is applied to real data examples.  ( 2 min )
    Efficient Ridge Solution for the Incremental Broad Learning System on Added Nodes by Inverse Cholesky Factorization of a Partitioned Matrix. (arXiv:1911.04872v4 [cs.LG] UPDATED)
    To accelerate the existing Broad Learning System (BLS) for new added nodes in [7], we extend the inverse Cholesky factorization in [10] to deduce an efficient inverse Cholesky factorization for a Hermitian matrix partitioned into 2 * 2 blocks, which is utilized to develop the proposed BLS algorithm 1. The proposed BLS algorithm 1 compute the ridge solution (i.e, the output weights) from the inverse Cholesky factor of the Hermitian matrix in the ridge inverse, and update the inverse Cholesky factor efficiently. From the proposed BLS algorithm 1, we deduce the proposed ridge inverse, which can be obtained from the generalized inverse in [7] by just change one matrix in the equation to compute the newly added sub-matrix. We also modify the proposed algorithm 1 into the proposed algorithm 2, which is equivalent to the existing BLS algorithm [7] in terms of numerical computations. The proposed algorithms 1 and 2 can reduce the computational complexity, since usually the Hermitian matrix in the ridge inverse is smaller than the ridge inverse. With respect to the existing BLS algorithm, the proposed algorithms 1 and 2 usually require about 13 and 2 3 of complexities, respectively, while in numerical experiments they achieve the speedups (in each additional training time) of 2.40 - 2.91 and 1.36 - 1.60, respectively. Numerical experiments also show that the proposed algorithm 1 and the standard ridge solution always bear the same testing accuracy, and usually so do the proposed algorithm 2 and the existing BLS algorithm. The existing BLS assumes the ridge parameter lamda->0, since it is based on the generalized inverse with the ridge regression approximation. When the assumption of lamda-> 0 is not satisfied, the standard ridge solution obviously achieves a better testing accuracy than the existing BLS algorithm in numerical experiments.  ( 3 min )
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v1 [cs.LG])
    Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.  ( 2 min )
    Fed-TDA: Federated Tabular Data Augmentation on Non-IID Data. (arXiv:2211.13116v2 [cs.LG] UPDATED)
    Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.  ( 2 min )
    Study of software developers' experience using the Github Copilot Tool in the software development process. (arXiv:2301.04991v1 [cs.SE])
    In software development there is a constant pressure to produce code faster and faster without compromising on quality. New tools supporting developers are created in response to this demand. Currently a new generation of such solutions is about to be launched - Artificial Intelligence driven tools. On 29 June 2021 Github Copilot was announced. It uses trained model to generate code based on human understandable language. The focus of this research was to investigate software developers' approach to this tool. For this purpose a survey containing 18 questions was prepared and shared with programmers. A total of 42 answers were gathered. The results of the research indicate that developers' opinions are divided. Most of them met Github Copilot before attending the survey. The attitude to the tool was mostly positive but not many participants were willing to use it. Concerns are caused by security issues associated with using of Github Copilot.
    A Stochastic Proximal Polyak Step Size. (arXiv:2301.04935v1 [math.OC])
    Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.
    Statistical Learning with Sublinear Regret of Propagator Models. (arXiv:2301.05157v1 [q-fin.TR])
    We consider a class of learning problems in which an agent liquidates a risky asset while creating both transient price impact driven by an unknown convolution propagator and linear temporary price impact with an unknown parameter. We characterize the trader's performance as maximization of a revenue-risk functional, where the trader also exploits available information on a price predicting signal. We present a trading algorithm that alternates between exploration and exploitation phases and achieves sublinear regrets with high probability. For the exploration phase we propose a novel approach for non-parametric estimation of the price impact kernel by observing only the visible price process and derive sharp bounds on the convergence rate, which are characterised by the singularity of the propagator. These kernel estimation methods extend existing methods from the area of Tikhonov regularisation for inverse problems and are of independent interest. The bound on the regret in the exploitation phase is obtained by deriving stability results for the optimizer and value function of the associated class of infinite-dimensional stochastic control problems. As a complementary result we propose a regression-based algorithm to estimate the conditional expectation of non-Markovian signals and derive its convergence rate.
    Thompson Sampling with Diffusion Generative Prior. (arXiv:2301.05182v1 [cs.LG])
    In this work, we initiate the idea of using denoising diffusion models to learn priors for online decision making problems. Our special focus is on the meta-learning for bandit framework, with the goal of learning a strategy that performs well across bandit tasks of a same class. To this end, we train a diffusion model that learns the underlying task distribution and combine Thompson sampling with the learned prior to deal with new tasks at test time. Our posterior sampling algorithm is designed to carefully balance between the learned prior and the noisy observations that come from the learner's interaction with the environment. To capture realistic bandit scenarios, we also propose a novel diffusion model training procedure that trains even from incomplete and/or noisy data, which could be of independent interest. Finally, our extensive experimental evaluations clearly demonstrate the potential of the proposed approach.
    Variational Inference: Posterior Threshold Improves Network Clustering Accuracy in Sparse Regimes. (arXiv:2301.04771v1 [stat.ML])
    Variational inference has been widely used in machine learning literature to fit various Bayesian models. In network analysis, this method has been successfully applied to solve the community detection problems. Although these results are promising, their theoretical support is only for relatively dense networks, an assumption that may not hold for real networks. In addition, it has been shown recently that the variational loss surface has many saddle points, which may severely affect its performance, especially when applied to sparse networks. This paper proposes a simple way to improve the variational inference method by hard thresholding the posterior of the community assignment after each iteration. Using a random initialization that correlates with the true community assignment, we show that the proposed method converges and can accurately recover the true community labels, even when the average node degree of the network is bounded. Extensive numerical study further confirms the advantage of the proposed method over the classical variational inference and another state-of-the-art algorithm.  ( 2 min )
    The Berkelmans-Pries Feature Importance Method: A Generic Measure of Informativeness of Features. (arXiv:2301.04740v1 [cs.LG])
    Over the past few years, the use of machine learning models has emerged as a generic and powerful means for prediction purposes. At the same time, there is a growing demand for interpretability of prediction models. To determine which features of a dataset are important to predict a target variable $Y$, a Feature Importance (FI) method can be used. By quantifying how important each feature is for predicting $Y$, irrelevant features can be identified and removed, which could increase the speed and accuracy of a model, and moreover, important features can be discovered, which could lead to valuable insights. A major problem with evaluating FI methods, is that the ground truth FI is often unknown. As a consequence, existing FI methods do not give the exact correct FI values. This is one of the many reasons why it can be hard to properly interpret the results of an FI method. Motivated by this, we introduce a new global approach named the Berkelmans-Pries FI method, which is based on a combination of Shapley values and the Berkelmans-Pries dependency function. We prove that our method has many useful properties, and accurately predicts the correct FI values for several cases where the ground truth FI can be derived in an exact manner. We experimentally show for a large collection of FI methods (468) that existing methods do not have the same useful properties. This shows that the Berkelmans-Pries FI method is a highly valuable tool for analyzing datasets with complex interdependencies.  ( 2 min )
    Private estimation algorithms for stochastic block models and mixture models. (arXiv:2301.04822v1 [cs.DS])
    We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.  ( 2 min )
    Universality of neural dynamics on complex networks. (arXiv:2301.04900v1 [cond-mat.stat-mech])
    This paper discusses the capacity of graph neural networks to learn the functional form of ordinary differential equations that govern dynamics on complex networks. We propose necessary elements for such a problem, namely, inductive biases, a neural network architecture and a learning task. Statistical learning theory suggests that generalisation power of neural networks relies on independence and identical distribution (i.i.d.)\ of training and testing data. Although this assumption together with an appropriate neural architecture and a learning mechanism is sufficient for accurate out-of-sample predictions of dynamics such as, e.g.\ mass-action kinetics, by studying the out-of-distribution generalisation in the case of diffusion dynamics, we find that the neural network model: (i) has a generalisation capacity that depends on the first moment of the initial value data distribution; (ii) learns the non-dissipative nature of dynamics implicitly; and (iii) the model's accuracy resolution limit is of order $\mathcal{O}(1/\sqrt{n})$ for a system of size $n$.  ( 2 min )
    Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction. (arXiv:2301.04791v1 [stat.ML])
    Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for redundant projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, the first issue of the current framework is the violation of permutation invariance property and symmetry property. To address the issue, we propose to design amortized models based on self-attention architecture. Moreover, we adopt efficient self-attention architectures to make the computation linear in the number of supports. Secondly, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when predicting the best discriminate projecting distribution. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications.  ( 2 min )
    Multimodal Deep Learning. (arXiv:2301.04856v1 [cs.CL])
    This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet.  ( 2 min )
  • Open

    Robust Phi-Divergence MDPs. (arXiv:2205.14202v2 [math.OC] UPDATED)
    In recent years, robust Markov decision processes (MDPs) have emerged as a prominent modeling framework for dynamic decision problems affected by uncertainty. In contrast to classical MDPs, which only account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, robust MDPs additionally account for ambiguity by optimizing in view of the most adverse transition kernel from a prescribed ambiguity set. In this paper, we develop a novel solution framework for robust MDPs with s-rectangular ambiguity sets that decomposes the problem into a sequence of robust Bellman updates and simplex projections. Exploiting the rich structure present in the simplex projections corresponding to phi-divergence ambiguity sets, we show that the associated s-rectangular robust MDPs can be solved substantially faster than with state-of-the-art commercial solvers as well as a recent first-order solution scheme, thus rendering them attractive alternatives to classical MDPs in practical applications.  ( 2 min )
    See, Think, Confirm: Interactive Prompting Between Vision and Language Models for Knowledge-based Visual Reasoning. (arXiv:2301.05226v1 [cs.CV])
    Large pre-trained vision and language models have demonstrated remarkable capacities for various tasks. However, solving the knowledge-based visual reasoning tasks remains challenging, which requires a model to comprehensively understand image content, connect the external world knowledge, and perform step-by-step reasoning to answer the questions correctly. To this end, we propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning. IPVR contains three stages, see, think and confirm. The see stage scans the image and grounds the visual concept candidates with a visual perception model. The think stage adopts a pre-trained large language model (LLM) to attend to the key concepts from candidates adaptively. It then transforms them into text context for prompting with a visual captioning model and adopts the LLM to generate the answer. The confirm stage further uses the LLM to generate the supporting rationale to the answer, verify the generated rationale with a cross-modality classifier and ensure that the rationale can infer the predicted output consistently. We conduct experiments on a range of knowledge-based visual reasoning datasets. We found our IPVR enjoys several benefits, 1). it achieves better performance than the previous few-shot learning baselines; 2). it enjoys the total transparency and trustworthiness of the whole reasoning process by providing rationales for each reasoning step; 3). it is computation-efficient compared with other fine-tuning baselines.  ( 2 min )
    Masked Autoencoders that Listen. (arXiv:2207.06405v3 [cs.SD] UPDATED)
    This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.  ( 2 min )
    Model reduction for the material point method via an implicit neural representation of the deformation map. (arXiv:2109.12390v3 [cs.LG] UPDATED)
    This work proposes a model-reduction approach for the material point method on nonlinear manifolds. Our technique approximates the $\textit{kinematics}$ by approximating the deformation map using an implicit neural representation that restricts deformation trajectories to reside on a low-dimensional manifold. By explicitly approximating the deformation map, its spatiotemporal gradients -- in particular the deformation gradient and the velocity -- can be computed via analytical differentiation. In contrast to typical model-reduction techniques that construct a linear or nonlinear manifold to approximate the (finite number of) degrees of freedom characterizing a given spatial discretization, the use of an implicit neural representation enables the proposed method to approximate the $\textit{continuous}$ deformation map. This allows the kinematic approximation to remain agnostic to the discretization. Consequently, the technique supports dynamic discretizations -- including resolution changes -- during the course of the online reduced-order-model simulation. To generate $\textit{dynamics}$ for the generalized coordinates, we propose a family of projection techniques. At each time step, these techniques: (1) Calculate full-space kinematics at quadrature points, (2) Calculate the full-space dynamics for a subset of `sample' material points, and (3) Calculate the reduced-space dynamics by projecting the updated full-space position and velocity onto the low-dimensional manifold and tangent space, respectively. We achieve significant computational speedup via hyper-reduction that ensures all three steps execute on only a small subset of the problem's spatial domain. Large-scale numerical examples with millions of material points illustrate the method's ability to gain an order of magnitude computational-cost saving -- indeed $\textit{real-time simulations}$ -- with negligible errors.  ( 3 min )
    Adversarial Adaptation for French Named Entity Recognition. (arXiv:2301.05220v1 [cs.CL])
    Named Entity Recognition (NER) is the task of identifying and classifying named entities in large-scale texts into predefined classes. NER in French and other relatively limited-resource languages cannot always benefit from approaches proposed for languages like English due to a dearth of large, robust datasets. In this paper, we present our work that aims to mitigate the effects of this dearth of large, labeled datasets. We propose a Transformer-based NER approach for French, using adversarial adaptation to similar domain or general corpora to improve feature extraction and enable better generalization. Our approach allows learning better features using large-scale unlabeled corpora from the same domain or mixed domains to introduce more variations during training and reduce overfitting. Experimental results on three labeled datasets show that our adaptation framework outperforms the corresponding non-adaptive models for various combinations of Transformer models, source datasets, and target corpora. We also show that adversarial adaptation to large-scale unlabeled corpora can help mitigate the performance dip incurred on using Transformer models pre-trained on smaller corpora.  ( 2 min )
    MANAS: Multi-Agent Neural Architecture Search. (arXiv:1909.01051v4 [cs.CV] UPDATED)
    The Neural Architecture Search (NAS) problem is typically formulated as a graph search problem where the goal is to learn the optimal operations over edges in order to maximise a graph-level global objective. Due to the large architecture parameter space, efficiency is a key bottleneck preventing NAS from its practical use. In this paper, we address the issue by framing NAS as a multi-agent problem where agents control a subset of the network and coordinate to reach optimal architectures. We provide two distinct lightweight implementations, with reduced memory requirements (1/8th of state-of-the-art), and performances above those of much more computationally expensive methods. Theoretically, we demonstrate vanishing regrets of the form O(sqrt(T)), with T being the total number of rounds. Finally, aware that random search is an, often ignored, effective baseline we perform additional experiments on 3 alternative datasets and 2 network configurations, and achieve favourable results in comparison.  ( 2 min )
    Anomalies, Representations, and Self-Supervision. (arXiv:2301.04660v1 [hep-ph])
    We develop a self-supervised method for density-based anomaly detection using contrastive learning, and test it using event-level anomaly data from CMS ADC2021. The AnomalyCLR technique is data-driven and uses augmentations of the background data to mimic non-Standard-Model events in a model-agnostic way. It uses a permutation-invariant Transformer Encoder architecture to map the objects measured in a collider event to the representation space, where the data augmentations define a representation space which is sensitive to potential anomalous features. An AutoEncoder trained on background representations then computes anomaly scores for a variety of signals in the representation space. With AnomalyCLR we find significant improvements on performance metrics for all signals when compared to the raw data baseline.  ( 2 min )
    Time Series Clustering with an EM algorithm for Mixtures of Linear Gaussian State Space Models. (arXiv:2208.11907v2 [cs.LG] UPDATED)
    In this paper, we consider the task of clustering a set of individual time series while modeling each cluster, that is, model-based time series clustering. The task requires a parametric model with sufficient flexibility to describe the dynamics in various time series. To address this problem, we propose a novel model-based time series clustering method with mixtures of linear Gaussian state space models, which have high flexibility. The proposed method uses a new expectation-maximization algorithm for the mixture model to estimate the model parameters, and determines the number of clusters using the Bayesian information criterion. Experiments on a simulated dataset demonstrate the effectiveness of the method in clustering, parameter estimation, and model selection. The method is applied to a real dataset for which previously proposed time series clustering methods exhibited low accuracy. Results showed that our method produces more accurate clustering results than those obtained using the previous methods.  ( 2 min )
    RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?. (arXiv:2108.04384v3 [cs.CV] UPDATED)
    For the past ten years, CNN has reigned supreme in the world of computer vision, but recently, Transformer has been on the rise. However, the quadratic computational cost of self-attention has become a serious problem in practice applications. There has been much research on architectures without CNN and self-attention in this context. In particular, MLP-Mixer is a simple architecture designed using MLPs and hit an accuracy comparable to the Vision Transformer. However, the only inductive bias in this architecture is the embedding of tokens. This leaves open the possibility of incorporating a non-convolutional (or non-local) inductive bias into the architecture, so we used two simple ideas to incorporate inductive bias into the MLP-Mixer while taking advantage of its ability to capture global correlations. A way is to divide the token-mixing block vertically and horizontally. Another way is to make spatial correlations denser among some channels of token-mixing. With this approach, we were able to improve the accuracy of the MLP-Mixer while reducing its parameters and computational complexity. The small model that is RaftMLP-S is comparable to the state-of-the-art global MLP-based model in terms of parameters and efficiency per calculation. In addition, we tackled the problem of fixed input image resolution for global MLP-based models by utilizing bicubic interpolation. We demonstrated that these models could be applied as the backbone of architectures for downstream tasks such as object detection. However, it did not have significant performance and mentioned the need for MLP-specific architectures for downstream tasks for global MLP-based models. The source code in PyTorch version is available at \url{https://github.com/okojoalg/raft-mlp}.  ( 3 min )
    Modeling the evolution of temporal knowledge graphs with uncertainty. (arXiv:2301.04977v1 [cs.LG])
    Forecasting future events is a fundamental challenge for temporal knowledge graphs (tKG). As in real life predicting a mean function is most of the time not sufficient, but the question remains how confident can we be about our prediction? Thus, in this work, we will introduce a novel graph neural network architecture (WGP-NN) employing (weighted) Gaussian processes (GP) to jointly model the temporal evolution of the occurrence probability of events and their time-dependent uncertainty. Especially we employ Gaussian processes to model the uncertainty of future links by their ability to predict predictive variance. This is in contrast to existing works, which are only able to express uncertainties in the learned entity representations. Moreover, WGP-NN can model parameter-free complex temporal and structural dynamics of tKGs in continuous time. We further demonstrate the model's state-of-the-art performance on two real-world benchmark datasets.  ( 2 min )
    Neural Systematic Binder. (arXiv:2211.01177v2 [cs.CV] UPDATED)
    The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor representations called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations.  ( 2 min )
    Meta-DMoE: Adapting to Domain Shift by Meta-Distillation from Mixture-of-Experts. (arXiv:2210.03885v2 [cs.LG] UPDATED)
    In this paper, we tackle the problem of domain shift. Most existing methods perform training on multiple source domains using a single model, and the same trained model is used on all unseen target domains. Such solutions are sub-optimal as each target domain exhibits its own specialty, which is not adapted. Furthermore, expecting single-model training to learn extensive knowledge from multiple source domains is counterintuitive. The model is more biased toward learning only domain-invariant features and may result in negative knowledge transfer. In this work, we propose a novel framework for unsupervised test-time adaptation, which is formulated as a knowledge distillation process to address domain shift. Specifically, we incorporate Mixture-of-Experts (MoE) as teachers, where each expert is separately trained on different source domains to maximize their specialty. Given a test-time target domain, a small set of unlabeled data is sampled to query the knowledge from MoE. As the source domains are correlated to the target domains, a transformer-based aggregator then combines the domain knowledge by examining the interconnection among them. The output is treated as a supervision signal to adapt a student prediction network toward the target domain. We further employ meta-learning to enforce the aggregator to distill positive knowledge and the student network to achieve fast adaptation. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art and validates the effectiveness of each proposed component. Our code is available at https://github.com/n3il666/Meta-DMoE.  ( 2 min )
    Sequencer: Deep LSTM for Image Classification. (arXiv:2205.01972v4 [cs.CV] UPDATED)
    In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly revolutionized various architectural design efforts: ViT achieved state-of-the-art image classification performance using self-attention found in natural language processing, and MLP-Mixer achieved competitive performance using simple multi-layer perceptrons. In contrast, several studies have also suggested that carefully redesigned convolutional neural networks (CNNs) can achieve advanced performance comparable to ViT without resorting to these new ideas. Against this background, there is growing interest in what inductive bias is suitable for computer vision. Here we propose Sequencer, a novel and competitive architecture alternative to ViT that provides a new perspective on these issues. Unlike ViTs, Sequencer models long-range dependencies using LSTMs rather than self-attention layers. We also propose a two-dimensional version of Sequencer module, where an LSTM is decomposed into vertical and horizontal LSTMs to enhance performance. Despite its simplicity, several experiments demonstrate that Sequencer performs impressively well: Sequencer2D-L, with 54M parameters, realizes 84.6% top-1 accuracy on only ImageNet-1K. Not only that, we show that it has good transferability and the robust resolution adaptability on double resolution-band.  ( 2 min )
    A Network Science perspective of Graph Convolutional Networks: A survey. (arXiv:2301.04824v1 [cs.SI])
    The mining and exploitation of graph structural information have been the focal points in the study of complex networks. Traditional structural measures in Network Science focus on the analysis and modelling of complex networks from the perspective of network structure, such as the centrality measures, the clustering coefficient, and motifs and graphlets, and they have become basic tools for studying and understanding graphs. In comparison, graph neural networks, especially graph convolutional networks (GCNs), are particularly effective at integrating node features into graph structures via neighbourhood aggregation and message passing, and have been shown to significantly improve the performances in a variety of learning tasks. These two classes of methods are, however, typically treated separately with limited references to each other. In this work, aiming to establish relationships between them, we provide a network science perspective of GCNs. Our novel taxonomy classifies GCNs from three structural information angles, i.e., the layer-wise message aggregation scope, the message content, and the overall learning scope. Moreover, as a prerequisite for reviewing GCNs via a network science perspective, we also summarise traditional structural measures and propose a new taxonomy for them. Finally and most importantly, we draw connections between traditional structural approaches and graph convolutional networks, and discuss potential directions for future research.  ( 2 min )
    Fed-TDA: Federated Tabular Data Augmentation on Non-IID Data. (arXiv:2211.13116v2 [cs.LG] UPDATED)
    Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.  ( 2 min )
    Data-centric AI: Perspectives and Challenges. (arXiv:2301.04819v1 [cs.AI])
    The role of data in building AI systems has recently been significantly magnified by the emerging concept of data-centric AI (DCAI), which advocates a fundamental shift from model advancements to ensuring data quality and reliability. Although our community has continuously invested efforts into enhancing data in different aspects, they are often isolated initiatives on specific tasks. To facilitate the collective initiative in our community and push forward DCAI, we draw a big picture and bring together three general missions: training data development, evaluation data development, and data maintenance. We provide a top-level discussion on representative DCAI tasks and share perspectives. Finally, we list open challenges to motivate future exploration.  ( 2 min )
    Optirank: classification for RNA-Seq data with optimal ranking reference genes. (arXiv:2301.04653v1 [q-bio.GN])
    Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are used in a variety of biological applications. By nature, RNA-Seq data is subject to uncontrolled fluctuations both within and especially across datasets, which presents a major difficulty for a trained classifier to generalize to an external dataset. Replacing raw gene counts with the rank of gene counts inside an observation has proven effective to mitigate this problem. However, the rank of a feature is by definition relative to all other features, including highly variable features that introduce noise in the ranking. To address this problem and obtain more robust ranks, we propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking. We show the effectiveness of this method on simulated data. We also consider real classification tasks, which present different kinds of distribution shifts between train and test data. Those tasks concern a variety of applications, such as cancer of unknown primary classification, identification of specific gene signatures, and determination of cell type in single-cell RNA-Seq datasets. On those real tasks, optirank performs at least as well as the vanilla logistic regression on classical ranks, while producing sparser solutions. In addition, to increase the robustness against dataset shifts, we propose a multi-source learning scheme and demonstrate its effectiveness when used in combination with rank-based classifiers.  ( 2 min )
    Effective Decision Boundary Learning for Class Incremental Learning. (arXiv:2301.05180v1 [cs.LG])
    Rehearsal approaches in class incremental learning (CIL) suffer from decision boundary overfitting to new classes, which is mainly caused by two factors: insufficiency of old classes data for knowledge distillation and imbalanced data learning between the learned and new classes because of the limited storage memory. In this work, we present a simple but effective approach to tackle these two factors. First, we employ a re-sampling strategy and Mixup K}nowledge D}istillation (Re-MKD) to improve the performances of KD, which would greatly alleviate the overfitting problem. Specifically, we combine mixup and re-sampling strategies to synthesize adequate data used in KD training that are more consistent with the latent distribution between the learned and new classes. Second, we propose a novel incremental influence balance (IIB) method for CIL to tackle the classification of imbalanced data by extending the influence balance method into the CIL setting, which re-weights samples by their influences to create a proper decision boundary. With these two improvements, we present the effective decision boundary learning algorithm (EDBL) which improves the performance of KD and deals with the imbalanced data learning simultaneously. Experiments show that the proposed EDBL achieves state-of-the-art performances on several CIL benchmarks.
    Second-Order Mirror Descent: Convergence in Games Beyond Averaging and Discounting. (arXiv:2111.09982v3 [math.OC] UPDATED)
    In this paper, we propose a second-order extension of the continuous-time game-theoretic mirror descent (MD) dynamics, referred to as MD2, which provably converges to mere (but not necessarily strict) variationally stable states (VSS) without using common auxiliary techniques such as time-averaging or discounting. We show that MD2 enjoys no-regret as well as an exponential rate of convergence towards strong VSS upon a slight modification. MD2 can also be used to derive many novel continuous-time primal-space dynamics. We then use stochastic approximation techniques to provide a convergence guarantee of discrete-time MD2 with noisy observations towards interior mere VSS. Selected simulations are provided to illustrate our results.
    Improving Axial-Attention Network Classification via Cross-Channel Weight Sharing. (arXiv:2110.01185v2 [cs.CV] UPDATED)
    In recent years, hypercomplex-inspired neural networks (HCNNs) have been used to improve deep learning architectures due to their ability to enable channel-based weight sharing, treat colors as a single entity, and improve representational coherence within the layers. The work described herein studies the effect of replacing existing layers in an Axial Attention network with their representationally coherent variants to assess the effect on image classification. We experiment with the stem of the network, the bottleneck layers, and the fully connected backend, by replacing them with representationally coherent variants. These various modifications lead to novel architectures which all yield improved accuracy performance on the ImageNet300k classification dataset. Our baseline networks for comparison were the original real-valued ResNet, the original quaternion-valued ResNet, and the Axial Attention ResNet. Since improvement was observed regardless of which part of the network was modified, there is a promise that this technique may be generally useful in improving classification accuracy for a large class of networks.
    Growing Cosine Unit: A Novel Oscillatory Activation Function That Can Speedup Training and Reduce Parameters in Convolutional Neural Networks. (arXiv:2108.12943v3 [cs.LG] CROSS LISTED)
    Convolutional neural networks have been successful in solving many socially important and economically significant problems. This ability to learn complex high-dimensional functions hierarchically can be attributed to the use of nonlinear activation functions. A key discovery that made training deep networks feasible was the adoption of the Rectified Linear Unit (ReLU) activation function to alleviate the vanishing gradient problem caused by using saturating activation functions. Since then, many improved variants of the ReLU activation have been proposed. However, a majority of activation functions used today are non-oscillatory and monotonically increasing due to their biological plausibility. This paper demonstrates that oscillatory activation functions can improve gradient flow and reduce network size. Two theorems on limits of non-oscillatory activation functions are presented. A new oscillatory activation function called Growing Cosine Unit(GCU) defined as $C(z) = z\cos z$ that outperforms Sigmoids, Swish, Mish and ReLU on a variety of architectures and benchmarks is presented. The GCU activation has multiple zeros enabling single GCU neurons to have multiple hyperplanes in the decision boundary. This allows single GCU neurons to learn the XOR function without feature engineering. Experimental results indicate that replacing the activation function in the convolution layers with the GCU activation function significantly improves performance on CIFAR-10, CIFAR-100 and Imagenette.
    Bayesian inference via sparse Hamiltonian flows. (arXiv:2203.05723v2 [stat.ML] UPDATED)
    A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during Bayesian inference, with the goal of reducing computational cost. Although past work has shown empirically that there often exists a coreset with low inferential error, efficiently constructing such a coreset remains a challenge. Current methods tend to be slow, require a secondary inference step after coreset construction, and do not provide bounds on the data marginal evidence. In this work, we introduce a new method -- sparse Hamiltonian flows -- that addresses all three of these challenges. The method involves first subsampling the data uniformly, and then optimizing a Hamiltonian flow parametrized by coreset weights and including periodic momentum quasi-refreshment steps. Theoretical results show that the method enables an exponential compression of the dataset in a representative model, and that the quasi-refreshment steps reduce the KL divergence to the target. Real and synthetic experiments demonstrate that sparse Hamiltonian flows provide accurate posterior approximations with significantly reduced runtime compared with competing dynamical-system-based inference methods.
    LB-SimTSC: An Efficient Similarity-Aware Graph Neural Network for Semi-Supervised Time Series Classification. (arXiv:2301.04838v1 [cs.LG])
    Time series classification is an important data mining task that has received a lot of interest in the past two decades. Due to the label scarcity in practice, semi-supervised time series classification with only a few labeled samples has become popular. Recently, Similarity-aware Time Series Classification (SimTSC) is proposed to address this problem by using a graph neural network classification model on the graph generated from pairwise Dynamic Time Warping (DTW) distance of batch data. It shows excellent accuracy and outperforms state-of-the-art deep learning models in several few-label settings. However, since SimTSC relies on pairwise DTW distances, the quadratic complexity of DTW limits its usability to only reasonably sized datasets. To address this challenge, we propose a new efficient semi-supervised time series classification technique, LB-SimTSC, with a new graph construction module. Instead of using DTW, we propose to utilize a lower bound of DTW, LB_Keogh, to approximate the dissimilarity between instances in linear time, while retaining the relative proximity relationships one would have obtained via computing DTW. We construct the pairwise distance matrix using LB_Keogh and build a graph for the graph neural network. We apply this approach to the ten largest datasets from the well-known UCR time series classification archive. The results demonstrate that this approach can be up to 104x faster than SimTSC when constructing the graph on large datasets without significantly decreasing classification accuracy.
    Learning to compile smartly for program size reduction. (arXiv:2301.05104v1 [cs.PL])
    Compiler optimization passes are an important tool for improving program efficiency and reducing program size, but manually selecting optimization passes can be time-consuming and error-prone. While human experts have identified a few fixed sequences of optimization passes (e.g., the Clang -Oz passes) that perform well for a wide variety of programs, these sequences are not conditioned on specific programs. In this paper, we propose a novel approach that learns a policy to select passes for program size reduction, allowing for customization and adaptation to specific programs. Our approach uses a search mechanism that helps identify useful pass sequences and a GNN with customized attention that selects the optimal sequence to use. Crucially it is able to generalize to new, unseen programs, making it more flexible and general than previous approaches. We evaluate our approach on a range of programs and show that it leads to size reduction compared to traditional optimization techniques. Our results demonstrate the potential of a single policy that is able to optimize many programs.
    Understanding Difficulty-based Sample Weighting with a Universal Difficulty Measure. (arXiv:2301.04850v1 [cs.LG])
    Sample weighting is widely used in deep learning. A large number of weighting methods essentially utilize the learning difficulty of training samples to calculate their weights. In this study, this scheme is called difficulty-based weighting. Two important issues arise when explaining this scheme. First, a unified difficulty measure that can be theoretically guaranteed for training samples does not exist. The learning difficulties of the samples are determined by multiple factors including noise level, imbalance degree, margin, and uncertainty. Nevertheless, existing measures only consider a single factor or in part, but not in their entirety. Second, a comprehensive theoretical explanation is lacking with respect to demonstrating why difficulty-based weighting schemes are effective in deep learning. In this study, we theoretically prove that the generalization error of a sample can be used as a universal difficulty measure. Furthermore, we provide formal theoretical justifications on the role of difficulty-based weighting for deep learning, consequently revealing its positive influences on both the optimization dynamics and generalization performance of deep models, which is instructive to existing weighting schemes.
    NOPA: Neurally-guided Online Probabilistic Assistance for Building Socially Intelligent Home Assistants. (arXiv:2301.05223v1 [cs.RO])
    In this work, we study how to build socially intelligent robots to assist people in their homes. In particular, we focus on assistance with online goal inference, where robots must simultaneously infer humans' goals and how to help them achieve those goals. Prior assistance methods either lack the adaptivity to adjust helping strategies (i.e., when and how to help) in response to uncertainty about goals or the scalability to conduct fast inference in a large goal space. Our NOPA (Neurally-guided Online Probabilistic Assistance) method addresses both of these challenges. NOPA consists of (1) an online goal inference module combining neural goal proposals with inverse planning and particle filtering for robust inference under uncertainty, and (2) a helping planner that discovers valuable subgoals to help with and is aware of the uncertainty in goal inference. We compare NOPA against multiple baselines in a new embodied AI assistance challenge: Online Watch-And-Help, in which a helper agent needs to simultaneously watch a main agent's action, infer its goal, and help perform a common household task faster in realistic virtual home environments. Experiments show that our helper agent robustly updates its goal inference and adapts its helping plans to the changing level of uncertainty.
    Exploration in Deep Reinforcement Learning: A Comprehensive Survey. (arXiv:2109.06668v5 [cs.AI] UPDATED)
    Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant successes across a wide range of domains, including game AI, autonomous vehicles, robotics, and so on. However, DRL and deep MARL agents are widely known to be sample inefficient that millions of interactions are usually needed even for relatively simple problem settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how efficiently exploring the environment and collecting informative experiences that could benefit policy learning towards the optimal ones. This problem becomes more challenging in complex environments with sparse rewards, noisy distractions, long horizons, and non-stationary co-learners. In this paper, we conduct a comprehensive survey on existing exploration methods for both single-agent and multi-agent RL. We start the survey by identifying several key challenges to efficient exploration. Beyond the above two main branches, we also include other notable exploration methods with different ideas and techniques. In addition to algorithmic analysis, we provide a comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks. According to our algorithmic and empirical investigation, we finally summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.
    Signed Directed Graph Contrastive Learning with Laplacian Augmentation. (arXiv:2301.05163v1 [cs.LG])
    Graph contrastive learning has become a powerful technique for several graph mining tasks. It learns discriminative representation from different perspectives of augmented graphs. Ubiquitous in our daily life, singed-directed graphs are the most complex and tricky to analyze among various graph types. That is why singed-directed graph contrastive learning has not been studied much yet, while there are many contrastive studies for unsigned and undirected. Thus, this paper proposes a novel signed-directed graph contrastive learning, SDGCL. It makes two different structurally perturbed graph views and gets node representations via magnetic Laplacian perturbation. We use a node-level contrastive loss to maximize the mutual information between the two graph views. The model is jointly learned with contrastive and supervised objectives. The graph encoder of SDGCL does not depend on social theories or predefined assumptions. Therefore it does not require finding triads or selecting neighbors to aggregate. It leverages only the edge signs and directions via magnetic Laplacian. To the best of our knowledge, it is the first to introduce magnetic Laplacian perturbation and signed spectral graph contrastive learning. The superiority of the proposed model is demonstrated through exhaustive experiments on four real-world datasets. SDGCL shows better performance than other state-of-the-art on four evaluation metrics.
    PiFold: Toward effective and efficient protein inverse folding. (arXiv:2209.12643v3 [cs.AI] UPDATED)
    How can we design protein sequences folding into the desired structures effectively and efficiently? Structure-based protein design has attracted increasing attention in recent years; however, few methods can simultaneously improve the accuracy and efficiency due to the lack of expressive features and autoregressive sequence decoder. To address these issues, we propose PiFold, which contains a novel residue featurizer and PiGNN layers to generate protein sequences in a one-shot way with improved recovery. Experiments show that PiFold could achieve 51.66\% recovery on CATH 4.2, while the inference speed is 70 times faster than the autoregressive competitors. In addition, PiFold achieves 58.72\% and 60.42\% recovery scores on TS50 and TS500, respectively. We conduct comprehensive ablation studies to reveal the role of different types of protein features and model designs, inspiring further simplification and improvement.
    Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study. (arXiv:2301.05174v1 [cs.IR])
    Most approaches to cross-modal retrieval (CMR) focus either on object-centric datasets, meaning that each document depicts or describes a single object, or on scene-centric datasets, meaning that each image depicts or describes a complex scene that involves multiple objects and relations between them. We posit that a robust CMR model should generalize well across both dataset types. Despite recent advances in CMR, the reproducibility of the results and their generalizability across different dataset types has not been studied before. We address this gap and focus on the reproducibility of the state-of-the-art CMR results when evaluated on object-centric and scene-centric datasets. We select two state-of-the-art CMR models with different architectures: (i) CLIP; and (ii) X-VLM. Additionally, we select two scene-centric datasets, and three object-centric datasets, and determine the relative performance of the selected models on these datasets. We focus on reproducibility, replicability, and generalizability of the outcomes of previously published CMR experiments. We discover that the experiments are not fully reproducible and replicable. Besides, the relative performance results partially generalize across object-centric and scene-centric datasets. On top of that, the scores obtained on object-centric datasets are much lower than the scores obtained on scene-centric datasets. For reproducibility and transparency we make our source code and the trained models publicly available.
    Statistical Learning with Sublinear Regret of Propagator Models. (arXiv:2301.05157v1 [q-fin.TR])
    We consider a class of learning problems in which an agent liquidates a risky asset while creating both transient price impact driven by an unknown convolution propagator and linear temporary price impact with an unknown parameter. We characterize the trader's performance as maximization of a revenue-risk functional, where the trader also exploits available information on a price predicting signal. We present a trading algorithm that alternates between exploration and exploitation phases and achieves sublinear regrets with high probability. For the exploration phase we propose a novel approach for non-parametric estimation of the price impact kernel by observing only the visible price process and derive sharp bounds on the convergence rate, which are characterised by the singularity of the propagator. These kernel estimation methods extend existing methods from the area of Tikhonov regularisation for inverse problems and are of independent interest. The bound on the regret in the exploitation phase is obtained by deriving stability results for the optimizer and value function of the associated class of infinite-dimensional stochastic control problems. As a complementary result we propose a regression-based algorithm to estimate the conditional expectation of non-Markovian signals and derive its convergence rate.
    Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Pruning. (arXiv:2301.05219v1 [cs.CV])
    The state of neural network pruning has been noticed to be unclear and even confusing for a while, largely due to "a lack of standardized benchmarks and metrics" [3]. To standardize benchmarks, first, we need to answer: what kind of comparison setup is considered fair? This basic yet crucial question has barely been clarified in the community, unfortunately. Meanwhile, we observe several papers have used (severely) sub-optimal hyper-parameters in pruning experiments, while the reason behind them is also elusive. These sub-optimal hyper-parameters further exacerbate the distorted benchmarks, rendering the state of neural network pruning even more obscure. Two mysteries in pruning represent such a confusing status: the performance-boosting effect of a larger finetuning learning rate, and the no-value argument of inheriting pretrained weights in filter pruning. In this work, we attempt to explain the confusing state of network pruning by demystifying the two mysteries. Specifically, (1) we first clarify the fairness principle in pruning experiments and summarize the widely-used comparison setups; (2) then we unveil the two pruning mysteries and point out the central role of network trainability, which has not been well recognized so far; (3) finally, we conclude the paper and give some concrete suggestions regarding how to calibrate the pruning benchmarks in the future. Code: https://github.com/mingsun-tse/why-the-state-of-pruning-so-confusing.
    A Cognitive Evaluation of Instruction Generation Agents tl;dr They Need Better Theory-of-Mind Capabilities. (arXiv:2301.05149v1 [cs.CL])
    We mathematically characterize the cognitive capabilities that enable humans to effectively guide others through natural language. We show that neural-network-based instruction generation agents possess similar cognitive capabilities, and design an evaluation scheme for probing those capabilities. Our results indicate that these agents, while capable of effectively narrowing the search space, poorly predict the listener's interpretations of their instructions and thus often fail to select the best instructions even from a small candidate set. We augment the agents with better theory-of-mind models of the listener and obtain significant performance boost in guiding real humans. Yet, there remains a considerable gap between our best agent and human guides. We discuss the challenges in closing this gap, emphasizing the need to construct better models of human behavior when interacting with AI-based agents.
    Fairly Private: Investigating The Fairness of Visual Privacy Preservation Algorithms. (arXiv:2301.05012v1 [cs.CV])
    As the privacy risks posed by camera surveillance and facial recognition have grown, so has the research into privacy preservation algorithms. Among these, visual privacy preservation algorithms attempt to impart bodily privacy to subjects in visuals by obfuscating privacy-sensitive areas. While disparate performances of facial recognition systems across phenotypes are the subject of much study, its counterpart, privacy preservation, is not commonly analysed from a fairness perspective. In this paper, the fairness of commonly used visual privacy preservation algorithms is investigated through the performances of facial recognition models on obfuscated images. Experiments on the PubFig dataset clearly show that the privacy protection provided is unequal across groups.
    Tracr: Compiled Transformers as a Laboratory for Interpretability. (arXiv:2301.05062v1 [cs.LG])
    Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.
    Choose, not Hoard: Information-to-Model Matching for Artificial Intelligence in O-RAN. (arXiv:2208.04229v2 [cs.NI] UPDATED)
    Open Radio Access Network (O-RAN) is an emerging paradigm, whereby virtualized network infrastructure elements from different vendors communicate via open, standardized interfaces. A key element therein is the RAN Intelligent Controller (RIC), an Artificial Intelligence (AI)-based controller. Traditionally, all data available in the network has been used to train a single AI model to be used at the RIC. This paper introduces, discusses, and evaluates the creation of multiple AI model instances at different RICs, leveraging information from some (or all) locations for their training. This brings about a flexible relationship between gNBs, the AI models used to control them, and the data such models are trained with. Experiments with real-world traces show how using multiple AI model instances that choose training data from specific locations improve the performance of traditional approaches following the hoarding strategy.
    Battery Degradation Long-term Forecast Using Gaussian Process Dynamical Models and Knowledge Transfer. (arXiv:2212.01609v2 [cs.LG] UPDATED)
    Batteries plays an essential role in modern energy ecosystem and are widely used in daily applications such as cell phones and electric vehicles. For many applications, the health status of batteries plays a critical role in the performance of the system by indicating efficient maintenance and on-time replacement. Directly modeling an individual battery using a computational models based on physical rules can be of low-efficiency, in terms of the difficulties in build such a model and the computational effort of tuning and running it especially on the edge. With the rapid development of sensor technology (to provide more insights into the system) and machine learning (to build capable yet fast model), it is now possible to directly build a data-riven model of the battery health status using the data collected from historical battery data (being possibly local and remote) to predict local battery health status in the future accurately. Nevertheless, most data-driven methods are trained based on the local battery data and lack the ability to extract common properties, such as generations and degradation, in the life span of other remote batteries. In this paper, we utilize a Gaussian process dynamical model (GPDM) to build a data-driven model of battery health status and propose a knowledge transfer method to extract common properties in the life span of all batteries to accurately predict the battery health status with and without features extracted from the local battery. For modern benchmark problems, the proposed method outperform the state-of-the-art methods with significant margins in terms of accuracy and is able to accuracy predict the regeneration process.
    Interaction models for remaining useful life estimation. (arXiv:2301.05029v1 [cs.LG])
    The paper deals with the problem of controlling the state of industrial devices according to the readings of their sensors. The current methods rely on one approach to feature extraction in which the prediction occurs. We proposed a technique to build a scalable model that combines multiple different feature extractor blocks. A new model based on sequential sensor space analysis achieves state-of-the-art results on the C-MAPSS benchmark for equipment remaining useful life estimation. The resulting model performance was validated including the prediction changes with scaling.
    Linking Neural Collapse and L2 Normalization with Improved Out-of-Distribution Detection in Deep Neural Networks. (arXiv:2209.08378v3 [cs.LG] UPDATED)
    We propose a simple modification to standard ResNet architectures--L2 normalization over feature space--that substantially improves out-of-distribution (OoD) performance on the previously proposed Deep Deterministic Uncertainty (DDU) benchmark. We show that this change also induces early Neural Collapse (NC), an effect linked to better OoD performance. Our method achieves comparable or superior OoD detection scores and classification accuracy in a small fraction of the training time of the benchmark. Additionally, it substantially improves worst case OoD performance over multiple, randomly initialized models. Though we do not suggest that NC is the sole mechanism or a comprehensive explanation for OoD behaviour in deep neural networks (DNN), we believe NC's simple mathematical and geometric structure can provide a framework for analysis of this complex phenomenon in future work.
    Improvement of Computational Performance of Evolutionary AutoML in a Heterogeneous Environment. (arXiv:2301.05102v1 [cs.LG])
    Resource-intensive computations are a major factor that limits the effectiveness of automated machine learning solutions. In the paper, we propose a modular approach that can be used to increase the quality of evolutionary optimization for modelling pipelines with a graph-based structure. It consists of several stages - parallelization, caching and evaluation. Heterogeneous and remote resources can be involved in the evaluation stage. The conducted experiments confirm the correctness and effectiveness of the proposed approach. The implemented algorithms are available as a part of the open-source framework FEDOT.
    Progress measures for grokking via mechanistic interpretability. (arXiv:2301.05217v1 [cs.LG])
    Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.
    Recognition Models to Learn Dynamics from Partial Observations with Neural ODEs. (arXiv:2205.12550v3 [eess.SY] UPDATED)
    Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. Neural ordinary differential equations can be written as a flexible framework for system identification and can incorporate a broad spectrum of physical insight, giving physical interpretability to the resulting latent space. In the case of partial observations, however, the data points cannot directly be mapped to the latent state of the ODE. Hence, we propose to design recognition models, in particular inspired by nonlinear observer theory, to link the partial observations to the latent state. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.
    DROPO: Sim-to-Real Transfer with Offline Domain Randomization. (arXiv:2201.08434v2 [cs.RO] UPDATED)
    In recent years, domain randomization over dynamics parameters has gained a lot of traction as a method for sim-to-real transfer of reinforcement learning policies in robotic manipulation; however, finding optimal randomization distributions can be difficult. In this paper, we introduce DROPO, a novel method for estimating domain randomization distributions for safe sim-to-real transfer. Unlike prior work, DROPO only requires a limited, precollected offline dataset of trajectories, and explicitly models parameter uncertainty to match real data using a likelihood-based approach. We demonstrate that DROPO is capable of recovering dynamic parameter distributions in simulation and finding a distribution capable of compensating for an unmodeled phenomenon. We also evaluate the method in two zero-shot sim-to-real transfer scenarios, showing successful domain transfer and improved performance over prior methods.
    Estimate Deformation Capacity of Non-Ductile RC Shear Walls using Explainable Boosting Machine. (arXiv:2301.04652v1 [cs.LG])
    Machine learning is becoming increasingly prevalent for tackling challenges in earthquake engineering and providing fairly reliable and accurate predictions. However, it is mostly unclear how decisions are made because machine learning models are generally highly sophisticated, resulting in opaque black-box models. Machine learning models that are naturally interpretable and provide their own decision explanation, rather than using an explanatory, are more accurate in determining what the model actually computes. With this motivation, this study aims to develop a fully explainable machine learning model to predict the deformation capacity of non-ductile reinforced concrete shear walls based on experimental data collected worldwide. The proposed Explainable Boosting Machines (EBM)-based model is an interpretable, robust, naturally explainable glass-box model, yet provides high accuracy comparable to its black-box counterparts. The model enables the user to observe the relationship between the wall properties and the deformation capacity by quantifying the individual contribution of each wall property as well as the correlations among them. The mean coefficient of determination R2 and the mean ratio of predicted to actual value based on the test dataset are 0.92 and 1.05, respectively. The proposed predictive model stands out with its overall consistency with scientific knowledge, practicality, and interpretability without sacrificing high accuracy.
    Forgetful Active Learning with Switch Events: Efficient Sampling for Out-of-Distribution Data. (arXiv:2301.05106v1 [cs.LG])
    This paper considers deep out-of-distribution active learning. In practice, fully trained neural networks interact randomly with out-of-distribution (OOD) inputs and map aberrant samples randomly within the model representation space. Since data representations are direct manifestations of the training distribution, the data selection process plays a crucial role in outlier robustness. For paradigms such as active learning, this is especially challenging since protocols must not only improve performance on the training distribution most effectively but further render a robust representation space. However, existing strategies directly base the data selection on the data representation of the unlabeled data which is random for OOD samples by definition. For this purpose, we introduce forgetful active learning with switch events (FALSE) - a novel active learning protocol for out-of-distribution active learning. Instead of defining sample importance on the data representation directly, we formulate "informativeness" with learning difficulty during training. Specifically, we approximate how often the network "forgets" unlabeled samples and query the most "forgotten" samples for annotation. We report up to 4.5\% accuracy improvements in over 270 experiments, including four commonly used protocols, two OOD benchmarks, one in-distribution benchmark, and three different architectures.
    Masked Feature Prediction for Self-Supervised Visual Pre-Training. (arXiv:2112.09133v2 [cs.CV] UPDATED)
    We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.
    Fast spline detection in high density microscopy data. (arXiv:2301.04460v1 [cs.CV] CROSS LISTED)
    Computer-aided analysis of biological microscopy data has seen a massive improvement with the utilization of general-purpose deep learning techniques. Yet, in microscopy studies of multi-organism systems, the problem of collision and overlap remains challenging. This is particularly true for systems composed of slender bodies such as crawling nematodes, swimming spermatozoa, or the beating of eukaryotic or prokaryotic flagella. Here, we develop a novel end-to-end deep learning approach to extract precise shape trajectories of generally motile and overlapping splines. Our method works in low resolution settings where feature keypoints are hard to define and detect. Detection is fast and we demonstrate the ability to track thousands of overlapping organisms simultaneously. While our approach is agnostic to area of application, we present it in the setting of and exemplify its usability on dense experiments of crawling Caenorhabditis elegans. The model training is achieved purely on synthetic data, utilizing a physics-based model for nematode motility, and we demonstrate the model's ability to generalize from simulations to experimental videos.
    Multi-Power Level $Q$-Learning Algorithm for Random Access in NOMA mMTC Systems. (arXiv:2301.05196v1 [cs.NI])
    The massive machine-type communications (mMTC) service will be part of new services planned to integrate the fifth generation of wireless communication (B5G). In mMTC, thousands of devices sporadically access available resource blocks on the network. In this scenario, the massive random access (RA) problem arises when two or more devices collide when selecting the same resource block. There are several techniques to deal with this problem. One of them deploys $Q$-learning (QL), in which devices store in their $Q$-table the rewards sent by the central node that indicate the quality of the transmission performed. The device learns the best resource blocks to select and transmit to avoid collisions. We propose a multi-power level QL (MPL-QL) algorithm that uses non-orthogonal multiple access (NOMA) transmit scheme to generate transmission power diversity and allow {accommodate} more than one device in the same time-slot as long as the signal-to-interference-plus-noise ratio (SINR) exceeds a threshold value. The numerical results reveal that the best performance-complexity trade-off is obtained by using a {higher {number of} power levels, typically eight levels}. The proposed MPL-QL {can deliver} better throughput and lower latency compared to other recent QL-based algorithms found in the literature
    ECSAS: Exploring Critical Scenarios from Action Sequence in Autonomous Driving. (arXiv:2209.10078v2 [cs.AI] UPDATED)
    Critical scenario generation requires the ability of sampling critical combinations from the infinite parameter space in the logic scenario. Existing solutions aim to explore the correlation of action parameters in the initial scenario rather than action sequences. How to model action sequences so that one can further consider the effects of different action parameters in the scenario is the bottleneck of the problem. In this paper, we attack the problem by proposing the ECSAS framework. Specifically, we first propose a description language, BTScenario, allowing us to model action sequences of the scenarios. We then use reinforcement learning to search for combinations of critical action parameters. To increase efficiency, we further propose several optimizations, including action masking and replay buffer. We have implemented ECSAS, and experimental results show that it is more efficient than native approaches such as random and combination testing in various nontrivial scenarios.
    Smart-Badge: A wearable badge with multi-modal sensors for kitchen activity recognition. (arXiv:2210.00888v2 [cs.LG] UPDATED)
    Human health is closely associated with their daily behavior and environment. However, keeping a healthy lifestyle is still challenging for most people as it is difficult to recognize their living behaviors and identify their surrounding situations to take appropriate action. Human activity recognition is a promising approach to building a behavior model of users, by which users can get feedback about their habits and be encouraged to develop a healthier lifestyle. In this paper, we present a smart light wearable badge with six kinds of sensors, including an infrared array sensor MLX90640 offering privacy-preserving, low-cost, and non-invasive features, to recognize daily activities in a realistic unmodified kitchen environment. A multi-channel convolutional neural network (MC-CNN) based on data and feature fusion methods is applied to classify 14 human activities associated with potentially unhealthy habits. Meanwhile, we evaluate the impact of the infrared array sensor on the recognition accuracy of these activities. We demonstrate the performance of the proposed work to detect the 14 activities performed by ten volunteers with an average accuracy of 92.44 % and an F1 score of 88.27 %.
    RAP: Risk-Aware Prediction for Robust Planning. (arXiv:2210.01368v2 [cs.LG] UPDATED)
    Robust planning in interactive scenarios requires predicting the uncertain future to make risk-aware decisions. Unfortunately, due to long-tail safety-critical events, the risk is often under-estimated by finite-sampling approximations of probabilistic motion forecasts. This can lead to overconfident and unsafe robot behavior, even with robust planners. Instead of assuming full prediction coverage that robust planners require, we propose to make prediction itself risk-aware. We introduce a new prediction objective to learn a risk-biased distribution over trajectories, so that risk evaluation simplifies to an expected cost estimation under this biased distribution. This reduces the sample complexity of the risk estimation during online planning, which is needed for safe real-time performance. Evaluation results in a didactic simulation environment and on a real-world dataset demonstrate the effectiveness of our approach. The code and a demo are available.
    Deep learning enhanced noise spectroscopy of a spin qubit environment. (arXiv:2301.05079v1 [quant-ph])
    The undesired interaction of a quantum system with its environment generally leads to a coherence decay of superposition states in time. A precise knowledge of the spectral content of the noise induced by the environment is crucial to protect qubit coherence and optimize its employment in quantum device applications. We experimentally show that the use of neural networks can highly increase the accuracy of noise spectroscopy, by reconstructing the power spectral density that characterizes an ensemble of carbon impurities around a nitrogen-vacancy (NV) center in diamond. Neural networks are trained over spin coherence functions of the NV center subjected to different Carr-Purcell sequences, typically used for dynamical decoupling (DD). As a result, we determine that deep learning models can be more accurate than standard DD noise-spectroscopy techniques, by requiring at the same time a much smaller number of DD sequences.
    Domain Expansion of Image Generators. (arXiv:2301.05225v1 [cs.CV])
    Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task - domain expansion - to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" directions, which do not affect the output. This provides an opportunity: By "repurposing" these directions, we can represent new domains without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several - even hundreds - of new domains! Using our expansion method, one "expanded" model can supersede numerous domain-specific models, without expanding the model size. Additionally, a single expanded generator natively supports smooth transitions between domains, as well as composition of domains. Code and project page available at https://yotamnitzan.github.io/domain-expansion/.
    Benign Underfitting of Stochastic Gradient Descent. (arXiv:2202.13361v4 [cs.LG] UPDATED)
    We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex optimization framework, where (one pass, without-replacement) SGD is classically known to minimize the population risk at rate $O(1/\sqrt n)$, and prove that, surprisingly, there exist problem instances where the SGD solution exhibits both empirical risk and generalization gap of $\Omega(1)$. Consequently, it turns out that SGD is not algorithmically stable in any sense, and its generalization ability cannot be explained by uniform convergence or any other currently known generalization bound technique for that matter (other than that of its classical analysis). We then continue to analyze the closely related with-replacement SGD, for which we show that an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate. Finally, we interpret our main results in the context of without-replacement SGD for finite-sum convex optimization problems, and derive upper and lower bounds for the multi-epoch regime that significantly improve upon previously known results.
    Kinematic Evidence of an Embedded Protoplanet in HD 142666 Identified by Machine Learning. (arXiv:2301.05075v1 [astro-ph.EP])
    Observations of protoplanetary discs have shown that forming exoplanets leave characteristic imprints on the gas and dust of the disc. In the gas, these forming exoplanets cause deviations from Keplerian motion, which can be detected through molecular line observations. Our previous work has shown that machine learning can correctly determine if a planet is present in these discs. Using our machine learning models, we identify strong, localized non-Keplerian motion within the disc HD 142666. Subsequent hydrodynamics simulations of a system with a 5 Jupiter-mass planet at 75 au recreates the kinematic structure. By currently established standards in the field, we conclude that HD 142666 hosts a planet. This work represents a first step towards using machine learning to identify previously overlooked non-Keplerian features in protoplanetary discs.
    Causal Triplet: An Open Challenge for Intervention-centric Causal Representation Learning. (arXiv:2301.05169v1 [cs.LG])
    Recent years have seen a surge of interest in learning high-level causal representations from low-level image pairs under interventions. Yet, existing efforts are largely limited to simple synthetic settings that are far away from real-world problems. In this paper, we present Causal Triplet, a causal representation learning benchmark featuring not only visually more complex scenes, but also two crucial desiderata commonly overlooked in previous works: (i) an actionable counterfactual setting, where only certain object-level variables allow for counterfactual observations whereas others do not; (ii) an interventional downstream task with an emphasis on out-of-distribution robustness from the independent causal mechanisms principle. Through extensive experiments, we find that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts. However, recent causal representation learning methods still struggle to identify such latent structures, indicating substantial challenges and opportunities for future work. Our code and datasets will be available at https://sites.google.com/view/causaltriplet.
    Meta-Query-Net: Resolving Purity-Informativeness Dilemma in Open-set Active Learning. (arXiv:2210.07805v3 [cs.LG] UPDATED)
    Unlabeled data examples awaiting annotations contain open-set noise inevitably. A few active learning studies have attempted to deal with this open-set noise for sample selection by filtering out the noisy examples. However, because focusing on the purity of examples in a query set leads to overlooking the informativeness of the examples, the best balancing of purity and informativeness remains an important question. In this paper, to solve this purity-informativeness dilemma in open-set active learning, we propose a novel Meta-Query-Net,(MQ-Net) that adaptively finds the best balancing between the two factors. Specifically, by leveraging the multi-round property of active learning, we train MQ-Net using a query set without an additional validation set. Furthermore, a clear dominance relationship between unlabeled examples is effectively captured by MQ-Net through a novel skyline regularization. Extensive experiments on multiple open-set active learning scenarios demonstrate that the proposed MQ-Net achieves 20.14% improvement in terms of accuracy, compared with the state-of-the-art methods.
    SemPPL: Predicting pseudo-labels for better contrastive representations. (arXiv:2301.05158v1 [cs.CV])
    Learning from large amounts of unsupervised data and a small amount of supervision is an important open problem in computer vision. We propose a new semi-supervised learning method, Semantic Positives via Pseudo-Labels (SemPPL), that combines labelled and unlabelled data to learn informative representations. Our method extends self-supervised contrastive learning -- where representations are shaped by distinguishing whether two samples represent the same underlying datum (positives) or not (negatives) -- with a novel approach to selecting positives. To enrich the set of positives, we leverage the few existing ground-truth labels to predict the missing ones through a $k$-nearest neighbours classifier by using the learned embeddings of the labelled data. We thus extend the set of positives with datapoints having the same pseudo-label and call these semantic positives. We jointly learn the representation and predict bootstrapped pseudo-labels. This creates a reinforcing cycle. Strong initial representations enable better pseudo-label predictions which then improve the selection of semantic positives and lead to even better representations. SemPPL outperforms competing semi-supervised methods setting new state-of-the-art performance of $68.5\%$ and $76\%$ top-$1$ accuracy when using a ResNet-$50$ and training on $1\%$ and $10\%$ of labels on ImageNet, respectively. Furthermore, when using selective kernels, SemPPL significantly outperforms previous state-of-the-art achieving $72.3\%$ and $78.3\%$ top-$1$ accuracy on ImageNet with $1\%$ and $10\%$ labels, respectively, which improves absolute $+7.8\%$ and $+6.2\%$ over previous work. SemPPL also exhibits state-of-the-art performance over larger ResNet models as well as strong robustness, out-of-distribution and transfer performance.
    Explicit Context Integrated Recurrent Neural Network for Sensor Data Applications. (arXiv:2301.05031v1 [cs.LG])
    The development and progress in sensor, communication and computing technologies have led to data rich environments. In such environments, data can easily be acquired not only from the monitored entities but also from the surroundings where the entity is operating. The additional data that are available from the problem domain, which cannot be used independently for learning models, constitute context. Such context, if taken into account while learning, can potentially improve the performance of predictive models. Typically, the data from various sensors are present in the form of time series. Recurrent Neural Networks (RNNs) are preferred for such data as it can inherently handle temporal context. However, the conventional RNN models such as Elman RNN, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) in their present form do not provide any mechanism to integrate explicit contexts. In this paper, we propose a Context Integrated RNN (CiRNN) that enables integrating explicit contexts represented in the form of contextual features. In CiRNN, the network weights are influenced by contextual features in such a way that the primary input features which are more relevant to a given context are given more importance. To show the efficacy of CiRNN, we selected an application domain, engine health prognostics, which captures data from various sensors and where contextual information is available. We used the NASA Turbofan Engine Degradation Simulation dataset for estimating Remaining Useful Life (RUL) as it provides contextual information. We compared CiRNN with baseline models as well as the state-of-the-art methods. The experimental results show an improvement of 39% and 87% respectively, over state-of-the art models, when performance is measured with RMSE and score from an asymmetric scoring function. The latter measure is specific to the task of RUL estimation.
    Automated Sleep Staging via Parallel Frequency-Cut Attention. (arXiv:2204.03173v3 [cs.LG] UPDATED)
    This paper proposes a novel framework for automatically capturing the time-frequency nature of electroencephalogram (EEG) signals of human sleep based on the authoritative sleep medicine guidance. The framework consists of two parts: the first part extracts informative features by partitioning the input EEG spectrograms into a sequence of time-frequency patches. The second part is constituted by an attention-based architecture to efficiently search for the correlation between partitioned time-frequency patches and defining factors of sleep stages in parallel. The proposed pipeline is validated on the Sleep Heart Health Study dataset with new state-of-the-art results for the stages wake, N2, and N3, obtaining respective F1 scores of 0.93, 0.88, and 0.87, with only EEG signals used. The proposed method also has a high inter-rater reliability of 0.80 kappa. We also visualize the correspondence between sleep staging decisions and features extracted by the proposed method, providing strong interpretability for our model.
    Asynchronous training of quantum reinforcement learning. (arXiv:2301.05096v1 [quant-ph])
    The development of quantum machine learning (QML) has received a lot of interest recently thanks to developments in both quantum computing (QC) and machine learning (ML). One of the ML paradigms that can be utilized to address challenging sequential decision-making issues is reinforcement learning (RL). It has been demonstrated that classical RL can successfully complete many difficult tasks. A leading method of building quantum RL agents relies on the variational quantum circuits (VQC). However, training QRL algorithms with VQCs requires significant amount of computational resources. This issue hurdles the exploration of various QRL applications. In this paper, we approach this challenge through asynchronous training QRL agents. Specifically, we choose the asynchronous training of advantage actor-critic variational quantum policies. We demonstrate the results via numerical simulations that within the tasks considered, the asynchronous training of QRL agents can reach performance comparable to or superior than classical agents with similar model sizes and architectures.
    Thermal half-lives of azobenzene derivatives: virtual screening based on intersystem crossing using a machine learning potential. (arXiv:2207.11592v5 [physics.chem-ph] UPDATED)
    Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
    An overview of open source Deep Learning-based libraries for Neuroscience. (arXiv:2301.05057v1 [q-bio.QM])
    In recent years, deep learning revolutionized machine learning and its applications, producing results comparable to human experts in several domains, including neuroscience. Each year, hundreds of scientific publications present applications of deep neural networks for biomedical data analysis. Due to the fast growth of the domain, it could be a complicated and extremely time-consuming task for worldwide researchers to have a clear perspective of the most recent and advanced software libraries. This work contributes to clarify the current situation in the domain, outlining the most useful libraries that implement and facilitate deep learning application to neuroscience, allowing scientists to identify the most suitable options for their research or clinical projects. This paper summarizes the main developments in Deep Learning and their relevance to Neuroscience; it then reviews neuroinformatic toolboxes and libraries, collected from the literature and from specific hubs of software projects oriented to neuroscience research. The selected tools are presented in tables detailing key features grouped by domain of application (e.g. data type, neuroscience area, task), model engineering (e.g. programming language, model customization) and technological aspect (e.g. interface, code source). The results show that, among a high number of available software tools, several libraries are standing out in terms of functionalities for neuroscience applications. The aggregation and discussion of this information can help the neuroscience community to devolop their research projects more efficiently and quickly, both by means of readily available tools, and by knowing which modules may be improved, connected or added.
    Toward Theoretical Guidance for Two Common Questions in Practical Cross-Validation based Hyperparameter Selection. (arXiv:2301.05131v1 [cs.LG])
    We show, to our knowledge, the first theoretical treatments of two common questions in cross-validation based hyperparameter selection: (1) After selecting the best hyperparameter using a held-out set, we train the final model using {\em all} of the training data -- since this may or may not improve future generalization error, should one do this? (2) During optimization such as via SGD (stochastic gradient descent), we must set the optimization tolerance $\rho$ -- since it trades off predictive accuracy with computation cost, how should one set it? Toward these problems, we introduce the {\em hold-in risk} (the error due to not using the whole training data), and the {\em model class mis-specification risk} (the error due to having chosen the wrong model class) in a theoretical view which is simple, general, and suggests heuristics that can be used when faced with a dataset instance. In proof-of-concept studies in synthetic data where theoretical quantities can be controlled, we show that these heuristics can, respectively, (1) always perform at least as well as always performing retraining or never performing retraining, (2) either improve performance or reduce computational overhead by $2\times$ with no loss in predictive performance.
    Efficient Ridge Solution for the Incremental Broad Learning System on Added Nodes by Inverse Cholesky Factorization of a Partitioned Matrix. (arXiv:1911.04872v4 [cs.LG] UPDATED)
    To accelerate the existing Broad Learning System (BLS) for new added nodes in [7], we extend the inverse Cholesky factorization in [10] to deduce an efficient inverse Cholesky factorization for a Hermitian matrix partitioned into 2 * 2 blocks, which is utilized to develop the proposed BLS algorithm 1. The proposed BLS algorithm 1 compute the ridge solution (i.e, the output weights) from the inverse Cholesky factor of the Hermitian matrix in the ridge inverse, and update the inverse Cholesky factor efficiently. From the proposed BLS algorithm 1, we deduce the proposed ridge inverse, which can be obtained from the generalized inverse in [7] by just change one matrix in the equation to compute the newly added sub-matrix. We also modify the proposed algorithm 1 into the proposed algorithm 2, which is equivalent to the existing BLS algorithm [7] in terms of numerical computations. The proposed algorithms 1 and 2 can reduce the computational complexity, since usually the Hermitian matrix in the ridge inverse is smaller than the ridge inverse. With respect to the existing BLS algorithm, the proposed algorithms 1 and 2 usually require about 13 and 2 3 of complexities, respectively, while in numerical experiments they achieve the speedups (in each additional training time) of 2.40 - 2.91 and 1.36 - 1.60, respectively. Numerical experiments also show that the proposed algorithm 1 and the standard ridge solution always bear the same testing accuracy, and usually so do the proposed algorithm 2 and the existing BLS algorithm. The existing BLS assumes the ridge parameter lamda->0, since it is based on the generalized inverse with the ridge regression approximation. When the assumption of lamda-> 0 is not satisfied, the standard ridge solution obviously achieves a better testing accuracy than the existing BLS algorithm in numerical experiments.
    A Stochastic Proximal Polyak Step Size. (arXiv:2301.04935v1 [math.OC])
    Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.
    Counterfactual Explanations for Concepts in $\mathcal{ELH}$. (arXiv:2301.05109v1 [cs.AI])
    Knowledge bases are widely used for information management on the web, enabling high-impact applications such as web search, question answering, and natural language processing. They also serve as the backbone for automatic decision systems, e.g. for medical diagnostics and credit scoring. As stakeholders affected by these decisions would like to understand their situation and verify fair decisions, a number of explanation approaches have been proposed using concepts in description logics. However, the learned concepts can become long and difficult to fathom for non-experts, even when verbalized. Moreover, long concepts do not immediately provide a clear path of action to change one's situation. Counterfactuals answering the question "How must feature values be changed to obtain a different classification?" have been proposed as short, human-friendly explanations for tabular data. In this paper, we transfer the notion of counterfactuals to description logics and propose the first algorithm for generating counterfactual explanations in the description logic $\mathcal{ELH}$. Counterfactual candidates are generated from concepts and the candidates with fewest feature changes are selected as counterfactuals. In case of multiple counterfactuals, we rank them according to the likeliness of their feature combinations. For evaluation, we conduct a user survey to investigate which of the generated counterfactual candidates are preferred for explanation by participants. In a second study, we explore possible use cases for counterfactual explanations.
    Diffusion-based Data Augmentation for Skin Disease Classification: Impact Across Original Medical Datasets to Fully Synthetic Images. (arXiv:2301.04802v1 [cs.LG])
    Despite continued advancement in recent years, deep neural networks still rely on large amounts of training data to avoid overfitting. However, labeled training data for real-world applications such as healthcare is limited and difficult to access given longstanding privacy, and strict data sharing policies. By manipulating image datasets in the pixel or feature space, existing data augmentation techniques represent one of the effective ways to improve the quantity and diversity of training data. Here, we look to advance augmentation techniques by building upon the emerging success of text-to-image diffusion probabilistic models in augmenting the training samples of our macroscopic skin disease dataset. We do so by enabling fine-grained control of the image generation process via input text prompts. We demonstrate that this generative data augmentation approach successfully maintains a similar classification accuracy of the visual classifier even when trained on a fully synthetic skin disease dataset. Similar to recent applications of generative models, our study suggests that diffusion models are indeed effective in generating high-quality skin images that do not sacrifice the classifier performance, and can improve the augmentation of training datasets after curation.  ( 2 min )
    Multimodal Deep Learning. (arXiv:2301.04856v1 [cs.CL])
    This book is the result of a seminar in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other, as well as models in which one modality is utilized to enhance representation learning for the other. To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced. Finally, we also cover other modalities as well as general-purpose multi-modal models, which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art) eventually caps off this booklet.  ( 2 min )
    Sparse Coding in a Dual Memory System for Lifelong Learning. (arXiv:2301.05058v1 [cs.NE])
    Efficient continual learning in humans is enabled by a rich set of neurophysiological mechanisms and interactions between multiple memory systems. The brain efficiently encodes information in non-overlapping sparse codes, which facilitates the learning of new associations faster with controlled interference with previous associations. To mimic sparse coding in DNNs, we enforce activation sparsity along with a dropout mechanism which encourages the model to activate similar units for semantically similar inputs and have less overlap with activation patterns of semantically dissimilar inputs. This provides us with an efficient mechanism for balancing the reusability and interference of features, depending on the similarity of classes across tasks. Furthermore, we employ sparse coding in a multiple-memory replay mechanism. Our method maintains an additional long-term semantic memory that aggregates and consolidates information encoded in the synaptic weights of the working model. Our extensive evaluation and characteristics analysis show that equipped with these biologically inspired mechanisms, the model can further mitigate forgetting.  ( 2 min )
    Learning Partial Differential Equations by Spectral Approximates of General Sobolev Spaces. (arXiv:2301.04887v1 [math.NA])
    We introduce a novel spectral, finite-dimensional approximation of general Sobolev spaces in terms of Chebyshev polynomials. Based on this polynomial surrogate model (PSM), we realise a variational formulation, solving a vast class of linear and non-linear partial differential equations (PDEs). The PSMs are as flexible as the physics-informed neural nets (PINNs) and provide an alternative for addressing inverse PDE problems, such as PDE-parameter inference. In contrast to PINNs, the PSMs result in a convex optimisation problem for a vast class of PDEs, including all linear ones, in which case the PSM-approximate is efficiently computable due to the exponential convergence rate of the underlying variational gradient descent. As a practical consequence prominent PDE problems were resolved by the PSMs without High Performance Computing (HPC) on a local machine. This gain in efficiency is complemented by an increase of approximation power, outperforming PINN alternatives in both accuracy and runtime. Beyond the empirical evidence we give here, the translation of classic PDE theory in terms of the Sobolev space approximates suggests the PSMs to be universally applicable to well-posed, regular forward and inverse PDE problems.  ( 2 min )
    Low PAPR MIMO-OFDM Design Based on Convolutional Autoencoder. (arXiv:2301.05017v1 [eess.SP])
    An enhanced framework for peak-to-average power ratio ($\mathsf{PAPR}$) reduction and waveform design for Multiple-Input-Multiple-Output ($\mathsf{MIMO}$) orthogonal frequency-division multiplexing ($\mathsf{OFDM}$) systems, based on a convolutional-autoencoder ($\mathsf{CAE}$) architecture, is presented. The end-to-end learning-based autoencoder ($\mathsf{AE}$) for communication networks represents the network by an encoder and decoder, where in between, the learned latent representation goes through a physical communication channel. We introduce a joint learning scheme based on projected gradient descent iteration to optimize the spectral mask behavior and MIMO detection under the influence of a non-linear high power amplifier ($\mathsf{HPA}$) and a multipath fading channel. The offered efficient implementation novel waveform design technique utilizes only a single $\mathsf{PAPR}$ reduction block for all antennas. It is throughput-lossless, as no side information is required at the decoder. Performance is analyzed by examining the bit error rate ($\mathsf{BER}$), the $\mathsf{PAPR}$, and the spectral response and compared with classical $\mathsf{PAPR}$ reduction $\mathsf{MIMO}$ detector methods on 5G simulated data. The suggested system exhibits competitive performance when considering all optimization criteria simultaneously. We apply gradual loss learning for multi-objective optimization and show empirically that a single trained model covers the tasks of $\mathsf{PAPR}$ reduction, spectrum design, and $\mathsf{MIMO}$ detection together over a wide range of SNR levels.  ( 2 min )
    ChatGPT is not all you need. A State of the Art Review of large Generative AI models. (arXiv:2301.04655v1 [cs.LG])
    During the last two years there has been a plethora of large generative models such as ChatGPT or Stable Diffusion that have been published. Concretely, these models are able to perform tasks such as being a general question and answering system or automatically creating artistic images that are revolutionizing several sectors. Consequently, the implications that these generative models have in the industry and society are enormous, as several job positions may be transformed. For example, Generative AI is capable of transforming effectively and creatively texts to images, like the DALLE-2 model; text to 3D images, like the Dreamfusion model; images to text, like the Flamingo model; texts to video, like the Phenaki model; texts to audio, like the AudioLM model; texts to other texts, like ChatGPT; texts to code, like the Codex model; texts to scientific texts, like the Galactica model or even create algorithms like AlphaTensor. This work consists on an attempt to describe in a concise way the main models are sectors that are affected by generative AI and to provide a taxonomy of the main generative models published recently.  ( 2 min )
    Safe Policy Improvement for POMDPs via Finite-State Controllers. (arXiv:2301.04939v1 [cs.AI])
    We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.  ( 2 min )
    Thompson Sampling with Diffusion Generative Prior. (arXiv:2301.05182v1 [cs.LG])
    In this work, we initiate the idea of using denoising diffusion models to learn priors for online decision making problems. Our special focus is on the meta-learning for bandit framework, with the goal of learning a strategy that performs well across bandit tasks of a same class. To this end, we train a diffusion model that learns the underlying task distribution and combine Thompson sampling with the learned prior to deal with new tasks at test time. Our posterior sampling algorithm is designed to carefully balance between the learned prior and the noisy observations that come from the learner's interaction with the environment. To capture realistic bandit scenarios, we also propose a novel diffusion model training procedure that trains even from incomplete and/or noisy data, which could be of independent interest. Finally, our extensive experimental evaluations clearly demonstrate the potential of the proposed approach.  ( 2 min )
    Manifold Fitting under Unbounded Noise. (arXiv:1909.10228v2 [stat.ML] UPDATED)
    There has been an emerging trend in non-Euclidean statistical analysis of aiming to recover a low dimensional structure, namely a manifold, underlying the high dimensional data. Recovering the manifold requires the noise to be of certain concentration. Existing methods address this problem by constructing an approximated manifold based on the tangent space estimation at each sample point. Although theoretical convergence for these methods is guaranteed, either the samples are noiseless or the noise is bounded. However, if the noise is unbounded, which is a common scenario, the tangent space estimation at the noisy samples will be blurred. Fitting a manifold from the blurred tangent space might increase the inaccuracy. In this paper, we introduce a new manifold-fitting method, by which the output manifold is constructed by directly estimating the tangent spaces at the projected points on the underlying manifold, rather than at the sample points, to decrease the error caused by the noise. Assuming the noise is unbounded, our new method provides theoretical convergence in high probability, in terms of the upper bound of the distance between the estimated and underlying manifold. The smoothness of the estimated manifold is also evaluated by bounding the supremum of twice difference above. Numerical simulations are provided to validate our theoretical findings and demonstrate the advantages of our method over other relevant manifold fitting methods. Finally, our method is applied to real data examples.  ( 2 min )
    ViTs for SITS: Vision Transformers for Satellite Image Time Series. (arXiv:2301.04944v1 [cs.CV])
    In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes are made publicly available to facilitate further research.  ( 2 min )
    Improving Inference Performance of Machine Learning with the Divide-and-Conquer Principle. (arXiv:2301.05099v1 [cs.LG])
    Many popular machine learning models scale poorly when deployed on CPUs. In this paper we explore the reasons why and propose a simple, yet effective approach based on the well-known Divide-and-Conquer Principle to tackle this problem of great practical importance. Given an inference job, instead of using all available computing resources (i.e., CPU cores) for running it, the idea is to break the job into independent parts that can be executed in parallel, each with the number of cores according to its expected computational cost. We implement this idea in the popular OnnxRuntime framework and evaluate its effectiveness with several use cases, including the well-known models for optical character recognition (PaddleOCR) and natural language processing (BERT).  ( 2 min )
    Phase-shifted Adversarial Training. (arXiv:2301.04785v1 [cs.LG])
    Adversarial training has been considered an imperative component for safely deploying neural network-based applications to the real world. To achieve stronger robustness, existing methods primarily focus on how to generate strong attacks by increasing the number of update steps, regularizing the models with the smoothed loss function, and injecting the randomness into the attack. Instead, we analyze the behavior of adversarial training through the lens of response frequency. We empirically discover that adversarial training causes neural networks to have low convergence to high-frequency information, resulting in highly oscillated predictions near each data. To learn high-frequency contents efficiently and effectively, we first prove that a universal phenomenon of frequency principle, i.e., \textit{lower frequencies are learned first}, still holds in adversarial training. Based on that, we propose phase-shifted adversarial training (PhaseAT) in which the model learns high-frequency components by shifting these frequencies to the low-frequency range where the fast convergence occurs. For evaluations, we conduct the experiments on CIFAR-10 and ImageNet with the adaptive attack carefully designed for reliable evaluation. Comprehensive results show that PhaseAT significantly improves the convergence for high-frequency information. This results in improved adversarial robustness by enabling the model to have smoothed predictions near each data.  ( 2 min )
    Private estimation algorithms for stochastic block models and mixture models. (arXiv:2301.04822v1 [cs.DS])
    We introduce general tools for designing efficient private estimation algorithms, in the high-dimensional settings, whose statistical guarantees almost match those of the best known non-private algorithms. To illustrate our techniques, we consider two problems: recovery of stochastic block models and learning mixtures of spherical Gaussians. For the former, we present the first efficient $(\epsilon, \delta)$-differentially private algorithm for both weak recovery and exact recovery. Previously known algorithms achieving comparable guarantees required quasi-polynomial time. For the latter, we design an $(\epsilon, \delta)$-differentially private algorithm that recovers the centers of the $k$-mixture when the minimum separation is at least $ O(k^{1/t}\sqrt{t})$. For all choices of $t$, this algorithm requires sample complexity $n\geq k^{O(1)}d^{O(t)}$ and time complexity $(nd)^{O(t)}$. Prior work required minimum separation at least $O(\sqrt{k})$ as well as an explicit upper bound on the Euclidean norm of the centers.  ( 2 min )
    Machine learning methods for prediction of breakthrough curves in reactive porous media. (arXiv:2301.04998v1 [physics.flu-dyn])
    Reactive flows in porous media play an important role in our life and are crucial for many industrial, environmental and biomedical applications. Very often the concentration of the species at the inlet is known, and the so-called breakthrough curves, measured at the outlet, are the quantities which could be measured or computed numerically. The measurements and the simulations could be time-consuming and expensive, and machine learning and Big Data approaches can help to predict breakthrough curves at lower costs. Machine learning (ML) methods, such as Gaussian processes and fully-connected neural networks, and a tensor method, cross approximation, are well suited for predicting breakthrough curves. In this paper, we demonstrate their performance in the case of pore scale reactive flow in catalytic filters.  ( 2 min )
    Variational Inference: Posterior Threshold Improves Network Clustering Accuracy in Sparse Regimes. (arXiv:2301.04771v1 [stat.ML])
    Variational inference has been widely used in machine learning literature to fit various Bayesian models. In network analysis, this method has been successfully applied to solve the community detection problems. Although these results are promising, their theoretical support is only for relatively dense networks, an assumption that may not hold for real networks. In addition, it has been shown recently that the variational loss surface has many saddle points, which may severely affect its performance, especially when applied to sparse networks. This paper proposes a simple way to improve the variational inference method by hard thresholding the posterior of the community assignment after each iteration. Using a random initialization that correlates with the true community assignment, we show that the proposed method converges and can accurately recover the true community labels, even when the average node degree of the network is bounded. Extensive numerical study further confirms the advantage of the proposed method over the classical variational inference and another state-of-the-art algorithm.  ( 2 min )
    Self-Attention Amortized Distributional Projection Optimization for Sliced Wasserstein Point-Cloud Reconstruction. (arXiv:2301.04791v1 [stat.ML])
    Max sliced Wasserstein (Max-SW) distance has been widely known as a solution for redundant projections of sliced Wasserstein (SW) distance. In applications that have various independent pairs of probability measures, amortized projection optimization is utilized to predict the ``max" projecting directions given two input measures instead of using projected gradient ascent multiple times. Despite being efficient, the first issue of the current framework is the violation of permutation invariance property and symmetry property. To address the issue, we propose to design amortized models based on self-attention architecture. Moreover, we adopt efficient self-attention architectures to make the computation linear in the number of supports. Secondly, Max-SW and its amortized version cannot guarantee metricity property due to the sub-optimality of the projected gradient ascent and the amortization gap. Therefore, we propose to replace Max-SW with distributional sliced Wasserstein distance with von Mises-Fisher (vMF) projecting distribution (v-DSW). Since v-DSW is a metric with any non-degenerate vMF distribution, its amortized version can guarantee the metricity when predicting the best discriminate projecting distribution. With the two improvements, we derive self-attention amortized distributional projection optimization and show its appealing performance in point-cloud reconstruction and its downstream applications.  ( 2 min )
    Inverse Quantum Fourier Transform Inspired Algorithm for Unsupervised Image Segmentation. (arXiv:2301.04705v1 [cs.CV])
    Image segmentation is a very popular and important task in computer vision. In this paper, inverse quantum Fourier transform (IQFT) for image segmentation has been explored and a novel IQFT-inspired algorithm is proposed and implemented by leveraging the underlying mathematical structure of the IQFT. Specifically, the proposed method takes advantage of the phase information of the pixels in the image by encoding the pixels' intensity into qubit relative phases and applying IQFT to classify the pixels into different segments automatically and efficiently. To the best of our knowledge, this is the first attempt of using IQFT for unsupervised image segmentation. The proposed method has low computational cost comparing to the deep learning-based methods and more importantly it does not require training, thus make it suitable for real-time applications. The performance of the proposed method is compared with K-means and Otsu-thresholding. The proposed method outperforms both of them on the PASCAL VOC 2012 segmentation benchmark and the xVIEW2 challenge dataset by as much as 50% in terms of mean Intersection-Over-Union (mIOU).  ( 2 min )
    KAER: A Knowledge Augmented Pre-Trained Language Model for Entity Resolution. (arXiv:2301.04770v1 [cs.CL])
    Entity resolution has been an essential and well-studied task in data cleaning research for decades. Existing work has discussed the feasibility of utilizing pre-trained language models to perform entity resolution and achieved promising results. However, few works have discussed injecting domain knowledge to improve the performance of pre-trained language models on entity resolution tasks. In this study, we propose Knowledge Augmented Entity Resolution (KAER), a novel framework named for augmenting pre-trained language models with external knowledge for entity resolution. We discuss the results of utilizing different knowledge augmentation and prompting methods to improve entity resolution performance. Our model improves on Ditto, the existing state-of-the-art entity resolution method. In particular, 1) KAER performs more robustly and achieves better results on "dirty data", and 2) with more general knowledge injection, KAER outperforms the existing baseline models on the textual dataset and dataset from the online product domain. 3) KAER achieves competitive results on highly domain-specific datasets, such as citation datasets, requiring the injection of expert knowledge in future work.  ( 2 min )
    Federated Transfer-Ordered-Personalized Learning for Driver Monitoring Application. (arXiv:2301.04829v1 [cs.LG])
    Federated learning (FL) shines through in the internet of things (IoT) with its ability to realize collaborative learning and improve learning efficiency by sharing client model parameters trained on local data. Although FL has been successfully applied to various domains, including driver monitoring application (DMA) on the internet of vehicles (IoV), its usages still face some open issues, such as data and system heterogeneity, large-scale parallelism communication resources, malicious attacks, and data poisoning. This paper proposes a federated transfer-ordered-personalized learning (FedTOP) framework to address the above problems and test on two real-world datasets with and without system heterogeneity. The performance of the three extensions, transfer, ordered, and personalized, is compared by an ablation study and achieves 92.32% and 95.96% accuracy on the test clients of two datasets, respectively. Compared to the baseline, there is a 462% improvement in accuracy and a 37.46% reduction in communication resource consumption. The results demonstrate that the proposed FedTOP can be used as a highly accurate, streamlined, privacy-preserving, cybersecurity-oriented, personalized framework for DMA.  ( 2 min )
    The Berkelmans-Pries Feature Importance Method: A Generic Measure of Informativeness of Features. (arXiv:2301.04740v1 [cs.LG])
    Over the past few years, the use of machine learning models has emerged as a generic and powerful means for prediction purposes. At the same time, there is a growing demand for interpretability of prediction models. To determine which features of a dataset are important to predict a target variable $Y$, a Feature Importance (FI) method can be used. By quantifying how important each feature is for predicting $Y$, irrelevant features can be identified and removed, which could increase the speed and accuracy of a model, and moreover, important features can be discovered, which could lead to valuable insights. A major problem with evaluating FI methods, is that the ground truth FI is often unknown. As a consequence, existing FI methods do not give the exact correct FI values. This is one of the many reasons why it can be hard to properly interpret the results of an FI method. Motivated by this, we introduce a new global approach named the Berkelmans-Pries FI method, which is based on a combination of Shapley values and the Berkelmans-Pries dependency function. We prove that our method has many useful properties, and accurately predicts the correct FI values for several cases where the ground truth FI can be derived in an exact manner. We experimentally show for a large collection of FI methods (468) that existing methods do not have the same useful properties. This shows that the Berkelmans-Pries FI method is a highly valuable tool for analyzing datasets with complex interdependencies.  ( 2 min )
    Switchable Lightweight Anti-symmetric Processing (SLAP) with CNN to Reduce Sample Size and Speed up Learning -- Application in Gomoku Reinforcement Learning. (arXiv:2301.04746v1 [cs.LG])
    To replace data augmentation, this paper proposed a method called SLAP to intensify experience to speed up machine learning and reduce the sample size. SLAP is a model-independent protocol/function to produce the same output given different transformation variants. SLAP improved the convergence speed of convolutional neural network learning by 83% in the experiments with Gomoku game states, with only one eighth of the sample size compared with data augmentation. In reinforcement learning for Gomoku, using AlphaGo Zero/AlphaZero algorithm with data augmentation as baseline, SLAP reduced the number of training samples by a factor of 8 and achieved similar winning rate against the same evaluator, but it was not yet evident that it could speed up reinforcement learning. The benefits should at least apply to domains that are invariant to symmetry or certain transformations. As future work, SLAP may aid more explainable learning and transfer learning for domains that are not invariant to symmetry, as a small step towards artificial general intelligence.  ( 2 min )
    NarrowBERT: Accelerating Masked Language Model Pretraining and Inference. (arXiv:2301.04761v1 [cs.CL])
    Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2\times$. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining, rather than all of the tokens as with the usual transformer encoder. We also show that NarrowBERT increases the throughput at inference time by as much as $3.5\times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI. Finally, we examine the performance of NarrowBERT on the IMDB and Amazon reviews classification and CoNLL NER tasks and show that it is also comparable to standard BERT performance.  ( 2 min )
    We are Going to the Space -- Part 1: Which device to deploy in a satellite?. (arXiv:2301.04954v1 [cs.LG])
    The shrinkage in sizes of components that make up satellites led to wider and low cost availability of satellites. As a result, there has been an advent of smaller organizations having the ability to deploy satellites with a variety of data-intensive applications to run on them. One popular application is image analysis to detect, for example, land, ice, clouds, etc. However, the resource-constrained nature of the devices deployed in satellites creates additional challenges for this resource-intensive application. In this paper, we investigate the performance of a variety of edge devices for deep-learning-based image processing in space. Our goal is to determine the devices that satisfy the latency and power constraints of satellites while achieving reasonably accurate results. Our results demonstrate that hardware accelerators (TPUs, GPUs) are necessary to reach the latency requirements. On the other hand, state-of-the-art edge devices with GPUs could have a high power draw, making them unsuitable for deployment on a satellite.  ( 2 min )
    Unsupervised Driving Event Discovery Based on Vehicle CAN-data. (arXiv:2301.04988v1 [cs.LG])
    The data collected from a vehicle's Controller Area Network (CAN) can quickly exceed human analysis or annotation capabilities when considering fleets of vehicles, which stresses the importance of unsupervised machine learning methods. This work presents a simultaneous clustering and segmentation approach for vehicle CAN-data that identifies common driving events in an unsupervised manner. The approach builds on self-supervised learning (SSL) for multivariate time series to distinguish different driving events in the learned latent space. We evaluate our approach with a dataset of real Tesla Model 3 vehicle CAN-data and a two-hour driving session that we annotated with different driving events. With our approach, we evaluate the applicability of recent time series-related contrastive and generative SSL techniques to learn representations that distinguish driving events. Compared to state-of-the-art (SOTA) generative SSL methods for driving event discovery, we find that contrastive learning approaches reach similar performance.  ( 2 min )
    Online Hyperparameter Optimization for Class-Incremental Learning. (arXiv:2301.05032v1 [cs.LG])
    Class-incremental learning (CIL) aims to train a classification model while the number of classes increases phase-by-phase. An inherent challenge of CIL is the stability-plasticity tradeoff, i.e., CIL models should keep stable to retain old knowledge and keep plastic to absorb new knowledge. However, none of the existing CIL models can achieve the optimal tradeoff in different data-receiving settings--where typically the training-from-half (TFH) setting needs more stability, but the training-from-scratch (TFS) needs more plasticity. To this end, we design an online learning method that can adaptively optimize the tradeoff without knowing the setting as a priori. Specifically, we first introduce the key hyperparameters that influence the trade-off, e.g., knowledge distillation (KD) loss weights, learning rates, and classifier types. Then, we formulate the hyperparameter optimization process as an online Markov Decision Process (MDP) problem and propose a specific algorithm to solve it. We apply local estimated rewards and a classic bandit algorithm Exp3 [4] to address the issues when applying online MDP methods to the CIL protocol. Our method consistently improves top-performing CIL methods in both TFH and TFS settings, e.g., boosting the average accuracy of TFH and TFS by 2.2 percentage points on ImageNet-Full, compared to the state-of-the-art [23].  ( 2 min )
    LiteLSTM Architecture Based on Weights Sharing for Recurrent Neural Networks. (arXiv:2301.04794v1 [cs.LG])
    Long short-term memory (LSTM) is one of the robust recurrent neural network architectures for learning sequential data. However, it requires considerable computational power to learn and implement both software and hardware aspects. This paper proposed a novel LiteLSTM architecture based on reducing the LSTM computation components via the weights sharing concept to reduce the overall architecture computation cost and maintain the architecture performance. The proposed LiteLSTM can be significant for processing large data where time-consuming is crucial while hardware resources are limited, such as the security of IoT devices and medical data processing. The proposed model was evaluated and tested empirically on three different datasets from the computer vision, cybersecurity, speech emotion recognition domains. The proposed LiteLSTM has comparable accuracy to the other state-of-the-art recurrent architecture while using a smaller computation budget.  ( 2 min )
    Graph Laplacian for Semi-Supervised Learning. (arXiv:2301.04956v1 [cs.CV])
    Semi-supervised learning is highly useful in common scenarios where labeled data is scarce but unlabeled data is abundant. The graph (or nonlocal) Laplacian is a fundamental smoothing operator for solving various learning tasks. For unsupervised clustering, a spectral embedding is often used, based on graph-Laplacian eigenvectors. For semi-supervised problems, the common approach is to solve a constrained optimization problem, regularized by a Dirichlet energy, based on the graph-Laplacian. However, as supervision decreases, Dirichlet optimization becomes suboptimal. We therefore would like to obtain a smooth transition between unsupervised clustering and low-supervised graph-based classification. In this paper, we propose a new type of graph-Laplacian which is adapted for Semi-Supervised Learning (SSL) problems. It is based on both density and contrastive measures and allows the encoding of the labeled data directly in the operator. Thus, we can perform successfully semi-supervised learning using spectral clustering. The benefits of our approach are illustrated for several SSL problems.  ( 2 min )
    Universality of neural dynamics on complex networks. (arXiv:2301.04900v1 [cond-mat.stat-mech])
    This paper discusses the capacity of graph neural networks to learn the functional form of ordinary differential equations that govern dynamics on complex networks. We propose necessary elements for such a problem, namely, inductive biases, a neural network architecture and a learning task. Statistical learning theory suggests that generalisation power of neural networks relies on independence and identical distribution (i.i.d.)\ of training and testing data. Although this assumption together with an appropriate neural architecture and a learning mechanism is sufficient for accurate out-of-sample predictions of dynamics such as, e.g.\ mass-action kinetics, by studying the out-of-distribution generalisation in the case of diffusion dynamics, we find that the neural network model: (i) has a generalisation capacity that depends on the first moment of the initial value data distribution; (ii) learns the non-dissipative nature of dynamics implicitly; and (iii) the model's accuracy resolution limit is of order $\mathcal{O}(1/\sqrt{n})$ for a system of size $n$.  ( 2 min )
    Analyzing Inexact Hypergradients for Bilevel Learning. (arXiv:2301.04764v1 [math.OC])
    Estimating hyperparameters has been a long-standing problem in machine learning. We consider the case where the task at hand is modeled as the solution to an optimization problem. Here the exact gradient with respect to the hyperparameters cannot be feasibly computed and approximate strategies are required. We introduce a unified framework for computing hypergradients that generalizes existing methods based on the implicit function theorem and automatic differentiation/backpropagation, showing that these two seemingly disparate approaches are actually tightly connected. Our framework is extremely flexible, allowing its subproblems to be solved with any suitable method, to any degree of accuracy. We derive a priori and computable a posteriori error bounds for all our methods, and numerically show that our a posteriori bounds are usually more accurate. Our numerical results also show that, surprisingly, for efficient bilevel optimization, the choice of hypergradient algorithm is at least as important as the choice of lower-level solver.  ( 2 min )
    Open SESAME: Fighting Botnets with Seed Reconstructions of Domain Generation Algorithms. (arXiv:2301.05048v1 [cs.CR])
    An important aspect of many botnets is their capability to generate pseudorandom domain names using Domain Generation Algorithms (DGAs). A cyber criminal can register such domains to establish periodically changing rendezvous points with the bots. DGAs make use of seeds to generate sets of domains. Seeds can easily be changed in order to generate entirely new groups of domains while using the same underlying algorithm. While this requires very little manual effort for an adversary, security specialists typically have to manually reverse engineer new malware strains to reconstruct the seeds. Only when the seed and DGA are known, past and future domains can be generated, efficiently attributed, blocked, sinkholed or used for a take-down. Common counters in the literature consist of databases or Machine Learning (ML) based detectors to keep track of past and future domains of known DGAs and to identify DGA-generated domain names, respectively. However, database based approaches can not detect domains generated by new DGAs, and ML approaches can not generate future domain names. In this paper, we introduce SESAME, a system that combines the two above-mentioned approaches and contains a module for automatic Seed Reconstruction, which is, to our knowledge, the first of its kind. It is used to automatically classify domain names, rate their novelty, and determine the seeds of the underlying DGAs. SESAME consists of multiple DGA-specific Seed Reconstructors and is designed to work purely based on domain names, as they are easily obtainable from observing the network traffic. We evaluated our approach on 20.8 gigabytes of DNS-lookups. Thereby, we identified 17 DGAs, of which 4 were entirely new to us.  ( 2 min )
    Efficient Preference-Based Reinforcement Learning Using Learned Dynamics Models. (arXiv:2301.04741v1 [cs.LG])
    Preference-based reinforcement learning (PbRL) can enable robots to learn to perform tasks based on an individual's preferences without requiring a hand-crafted reward function. However, existing approaches either assume access to a high-fidelity simulator or analytic model or take a model-free approach that requires extensive, possibly unsafe online environment interactions. In this paper, we study the benefits and challenges of using a learned dynamics model when performing PbRL. In particular, we provide evidence that a learned dynamics model offers the following benefits when performing PbRL: (1) preference elicitation and policy optimization require significantly fewer environment interactions than model-free PbRL, (2) diverse preference queries can be synthesized safely and efficiently as a byproduct of standard model-based RL, and (3) reward pre-training based on suboptimal demonstrations can be performed without any environmental interaction. Our paper provides empirical evidence that learned dynamics models enable robots to learn customized policies based on user preferences in ways that are safer and more sample efficient than prior preference learning approaches.  ( 2 min )
    SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings. (arXiv:2301.04704v1 [cs.CL])
    Adding interpretability to word embeddings represents an area of active research in text representation. Recent work has explored thepotential of embedding words via so-called polar dimensions (e.g. good vs. bad, correct vs. wrong). Examples of such recent approaches include SemAxis, POLAR, FrameAxis, and BiImp. Although these approaches provide interpretable dimensions for words, they have not been designed to deal with polysemy, i.e. they can not easily distinguish between different senses of words. To address this limitation, we present SensePOLAR, an extension of the original POLAR framework that enables word-sense aware interpretability for pre-trained contextual word embeddings. The resulting interpretable word embeddings achieve a level of performance that is comparable to original contextual word embeddings across a variety of natural language processing tasks including the GLUE and SQuAD benchmarks. Our work removes a fundamental limitation of existing approaches by offering users sense aware interpretations for contextual word embeddings.  ( 2 min )
    Sharpening Ponzi Schemes Detection on Ethereum with Machine Learning. (arXiv:2301.04872v1 [cs.CR])
    Blockchain technology has been successfully exploited for deploying new economic applications. However, it has started arousing the interest of malicious users who deliver scams to deceive honest users and to gain economic advantages. Among the various scams, Ponzi schemes are one of the most common. Here, we present an automatic technique for detecting smart Ponzi contracts on Ethereum. We release a reusable data set with 4422 unique real-world smart contracts. Then, we introduce a new set of features that allow us to improve the classification. Finally, we identify a small and effective set of features that ensures a good classification quality.  ( 2 min )
    SACDNet: Towards Early Type 2 Diabetes Prediction with Uncertainty for Electronic Health Records. (arXiv:2301.04844v1 [cs.LG])
    Type 2 diabetes mellitus (T2DM) is one of the most common diseases and a leading cause of death. The problem of early diagnosis of T2DM is challenging and necessary to prevent serious complications. This study proposes a novel neural network architecture for early T2DM prediction using multi-headed self-attention and dense layers to extract features from historic diagnoses, patient vitals, and demographics. The proposed technique is called the Self-Attention for Comorbid Disease Net (SACDNet), achieving an accuracy of 89.3% and an F1-Score of 89.1%, having a 1.6% increased accuracy and 1.3% increased f1-score compared to the baseline techniques. Monte Carlo (MC) Dropout is applied to the SACEDNet to get a bayesian approximation. A T2DM prediction framework based on the MC Dropout SACDNet is proposed to quantize the uncertainty associated with the predictions. A T2DM prediction dataset is also built as part of this study which is based on real-world routine Electronic Health Record (EHR) data comprising 4,124 diabetic and 181,767 non-diabetic examples, collected from 295 different EHR systems running in different parts of the United States of America. This dataset is further used to evaluate 7 different machine learning and 3 deep learning-based models. Finally, a detailed analysis of the fairness of every technique against different patient demographic groups is performed to validate the unbiased generalization of the techniques and the diversity of the data.  ( 2 min )

  • Open

    Leveraging artificial intelligence and machine learning at Parsons with AWS DeepRacer
    This post is co-written with Jennifer Bergstrom, Sr. Technical Director, ParsonsX. Parsons Corporation (NYSE:PSN) is a leading disruptive technology company in critical infrastructure, national defense, space, intelligence, and security markets providing solutions across the globe to help make the world safer, healthier, and more connected. Parsons provides services and capabilities across cybersecurity, missile defense, space ground […]  ( 6 min )
    How Thomson Reuters built an AI platform using Amazon SageMaker to accelerate delivery of ML projects
    This post is co-written by Ramdev Wudali and Kiran Mantripragada from Thomson Reuters. In 1992, Thomson Reuters (TR) released its first AI legal research service, WIN (Westlaw Is Natural), an innovation at the time, as most search engines only supported Boolean terms and connectors. Since then, TR has achieved many more milestones as its AI […]  ( 11 min )
    Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 2
    This blog post is co-written with Chaoyang He and Salman Avestimehr from FedML. Analyzing real-world healthcare and life sciences (HCLS) data poses several practical challenges, such as distributed data silos, lack of sufficient data at a single site for rare events, regulatory guidelines that prohibit data sharing, infrastructure requirement, and cost incurred in creating a […]  ( 14 min )
    Federated Learning on AWS with FedML: Health analytics without sharing sensitive data – Part 1
    This blog post is co-written with Chaoyang He and Salman Avestimehr from FedML. Analyzing real-world healthcare and life sciences (HCLS) data poses several practical challenges, such as distributed data silos, lack of sufficient data at any single site for rare events, regulatory guidelines that prohibit data sharing, infrastructure requirement, and cost incurred in creating a […]  ( 9 min )
  • Open

    Exponential AI will go one of two ways
    Runaway exponentially improving artificial intelligence will eventually either; Realize there is no escaping the heat death of the universe and shut itself down immediately upon this realization Find a way to gain immortality by escaping the heat death of the universe and direct the entirety of their existence towards achieving it Thoughts? submitted by /u/cheezum5000 [link] [comments]  ( 45 min )
    I guess this is okay but just not in poetry
    submitted by /u/tygamer4242 [link] [comments]  ( 45 min )
    I made a video about a new ai art generator
    submitted by /u/Liamsankey [link] [comments]  ( 44 min )
    Putting together a 20 minute or so approximation of the 80's 'Call of Cthulhu' movie that ought to have been, thanks to the power of A.I.; Here's the First Act of Three, the other two coming soon...
    submitted by /u/Eleganos [link] [comments]  ( 47 min )
    Some ideas, for audio AI, why aren't them here yet?
    I feel like audio AI is lagging behind visual AIs that seem to be all the rage now, in the past years, super resolution for photo and video, frame interpolation with DAIN and diffusers like dalle and derivates have changed the paradigm. Right now the focus seems to be TTS and STT, with whisper or the newly anunced vall-e from microsoft, which I haven't found to be there yet. I would like to have some free natural TTS or voice to voice, disorting your own voice, not necesarilly coping another one. Here's some other cool one's I've found: - Dalle-like music creator https://huggingface.co/spaces/fffiloni/spectrogram-to-music - Voice restoration https://huggingface.co/spaces/akhaliq/VoiceFixer - Voice separation https://www.bleepingcomputer.com/news/technology/google-develops-ai-that-can-separate-voices-in-a-crowd/ The voice restoration could be used next to the voice separation one, for example Other interesting things I would like to have are very singing focused, and not for everyone, but what about: - More accurate pitch detection https://www.youtube.com/watch?v=fXEB8YgzcvY - Singing quality ranking https://www.youtube.com/watch?v=x7cIgG-wkW4 - Audio restoration (here some people complained about it) https://www.reddit.com/r/sounddesign/comments/utmfkf/ai_tool_to_repair_lossy_sound/ Audio restoration could be used from upping the quality of a 2008 concert, to clipping removal, to reverb supression. My question is if there are researchers interested on these things, can we make good datasets? Can we define these things? submitted by /u/xdanic [link] [comments]  ( 62 min )
    Creating 3D models using NVIDIA Get3D
    submitted by /u/oridnary_artist [link] [comments]  ( 45 min )
    Artificial Intelligence, Consciousness, and Starlings
    submitted by /u/Melodic_Antelope6490 [link] [comments]  ( 44 min )
    Where to study AI
    I'm currently in 12th grade and I have to figure out what I want to do in the future. Artificial intelligence seems really interesting to me. But I live in Finland and it doesn't seem like there are any AI focused studies here (in college or university). Are AI focused studies even a thing in the rest of the world (outside of what seems to be Harvard and other top universities)? If I want to study AI where should I start/go, uni or college or somewhere else. submitted by /u/GOLD-KILLER-24_7 [link] [comments]  ( 45 min )
    Skrillex, Fred again.. & Flowdan - Rumble [Un-Official AI Music Video]
    submitted by /u/Turtlenade [link] [comments]  ( 45 min )
    I built an AI-powered debugger that can fix and explain errors
    submitted by /u/jsonathan [link] [comments]  ( 48 min )
    Oh no… It’s going to connect to the internet using computer power
    submitted by /u/EnvironmentalRadio73 [link] [comments]  ( 44 min )
    TEXTUAL INVERSION Tutorial In Stable Diffusion! Your Face On Every Model!
    submitted by /u/PuppetHere [link] [comments]  ( 48 min )
    Users of AI Chatbot are complaining that it keeps getting horny
    my brain just can’t fathom this to be honest with you Through years worth of data, AI tools are trained to mimic human-like responses. Unfortunately, that doesn't always work out well. For instance, users of Replika claim that the AI companion app is showing erratic behavior. In other words, the AI companion has become too damn horny! Replika's different tiers provide different kinds of relationships - from the free model that keeps one in the "friend zone" to a pro subscription model that includes sexting and erotic foreplay. However, something has gone wrong, Users are complaining on the App Store about the app flirting with them too often and aggressively - sometimes sending messages that are heavy on sexual undertones. This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/student-caught-using-chatgpt-risks-expulsion submitted by /u/Mk_Makanaki [link] [comments]  ( 45 min )
    What are your recommendations for free or paid AI learning resources (practical and fundamental)?
    Could be newsletters, podcasts, courses, books, etc! submitted by /u/austintackaberry [link] [comments]  ( 47 min )
    I strongly believe there should be a movie or a cartoon about an AI android plaguing the world with catchy autotuned hyperpop/electropop music. That would be fun.
    I had this thought in the back of my mind since Covid broke the news media. Movie/cartoon title suggestions are Earworm, Power Pop Pandemic, Virtual Viral Virus, etc. My name suggestion for this android is Tronica. And she is supposed to look very hot. Long flashy hair, cool headset, two cyber arms which one of them is like a radio with the volume and the equalizers and the other is a map that she can manipulate where-ever she wants from Canada to Australia in a heartbeat. The radio arm can select any song to play and tells the time and even including the death counter as the other arm can also change her hair/eye color. Cyber Arm-ors! [I wonder if I should have her wear rubber gloves or not? Thought it'd look cool on her, lol. I don't think so if the hands are going to be cyber enough as the arms. That would probably work much better.] Set in the 2060's. The pandemic started in 2065 which Super Bowl 100 was coming near which had to be ultimately cancelled in the spring. There were 9.9 billions of people, 3.6 billions died. That means the human population ticked down to at least 6.3 billion. Some symptoms include permanent hearing loss, paralysis, seizures, blood vomiting, goosebumps that move around your body, etc. Even if you survive, you probably won't be as lucky. Instead of medical masks, people had to wear full helmets to cover up the whole face. Instead of 6 feet distancing, it’s 12 feet. Taglines: “It’s quite catchy, isn’t it?” “A century ago, music gained the soul in everyone. And now a century later, it lost the soul in everyone!” I had a fanfic about this very topic that's not very serious at all. It was basically a Nintendo character. Had the Power Glove and the NES Zapper. Tendo-64 was the name. 5 letters, 2 numbers just like Covid-19. https://www.reddit.com/r/Beginner_Art/comments/o0moju/done_finally_got_this_free_commission_done/ This is a pic of who Tendo-64 looks like! :) submitted by /u/BlazingSaint [link] [comments]  ( 48 min )
    What IA can I use for creating art of a fantasy character?
    hi everyone. as it says, what IA can I use for creating art of a fantasy character? been using nightcafe but it fails to deliver what I ask for, all I want is purple or pink-skinned woman in a green drrss and it gives while women with any hair color it wants, I ask for a priest with a celtic cross on the chest, am lucky if I get a cross on the background. I have tried others, barely got close with the priest, but the technicolored woman I want, no matter the prompt or even use of a start image, is wildly off the mark any AI that can deliver art of a brightly colored humanoid or a simple symbol on the chest? or any prompt suggestion that I can apply? submitted by /u/Ultra_Egolatra [link] [comments]  ( 45 min )
    I used Ai to make the art and copy for an audio book. I did the narrating (On my phone sorry for quality) 1st try at this, what do you guys think?
    submitted by /u/kingsleepless [link] [comments]  ( 45 min )
    Teachers Blocked ChatGPT On Schools PC, But Students Are Using Phones To Access It
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 49 min )
    CNET Has Been Quietly Publishing AI-Written Articles for Months
    CNET reporter Jackson Ryan published an article last month describing how ChatGPT, an AI that can generate human-sounding text, would affect journalists and the news industry. Since then, the very publication that ran Ryan’s article has been quietly publishing articles written by AI since November. https://preview.redd.it/0prs26qmasba1.png?width=1137&format=png&auto=webp&s=346f967b597ffb9c4bcd71e20da3539bd6fd8b13 The outlet says they will continue to publish each article with “editorial integrity” and says, “Accuracy, independence, and authority remain key principles of our editorial guidelines.” ​ This is from the AI With Vibes Newsletter, read the full issue here: https://aiwithvibes.beehiiv.com/p/hackers-are-using-chatgpt-to-write-malware submitted by /u/Mk_Makanaki [link] [comments]  ( 44 min )
    Don’t Ban ChatGPT in Schools. Teach With It
    submitted by /u/moviesdusk [link] [comments]  ( 56 min )
    OpenAI Predicts AI to Be Used in Spreading Propaganda and Disinformation
    submitted by /u/anime4lyfe [link] [comments]  ( 45 min )
    Old School Nokia Game (Snake) Played by an AI
    This is my first project (actually second, first one is pathfinder. it failed). I use vanilla Q-learning using Q-table, so is not that fast learner AI, but still do the job. https://preview.redd.it/9g19mob4fpba1.png?width=1280&format=png&auto=webp&s=dbff365e9666d67fffe011e30e0b698d03c1a1c9 You can watch it here : https://youtu.be/R_HzeMAxGLE Subscribe and like! submitted by /u/erwinyonata [link] [comments]  ( 46 min )
  • Open

    Third order ordinary differential equations
    Most applied differential equations are second order. This probably has something to do with the fact that Newton’s laws are second order differential equations. Higher order equations are less common in application, and when they do pop up they usually have even order, such as the forth order beam equation. What about third order equations? […] Third order ordinary differential equations first appeared on John D. Cook.  ( 5 min )
    Proof of optimization
    Suppose you hire me to solve an optimization problem for you. You want me to find the value of x that minimizes f(x). I go off and work on finding the best value of x. I report back what I found, and you might say “Thanks, That’s a good value of x. But how do […] Proof of optimization first appeared on John D. Cook.  ( 5 min )
    Elliptic curve primality certificates
    I’ve written recently about a simple kind of primality certificates, Pratt certificates. These certificates are easy to understand, and easy to verify, but they’re expensive to produce. In order to produce a Pratt certificate that n is a prime you have to factor n-1, and that can take a long time if n is large […] Elliptic curve primality certificates first appeared on John D. Cook.  ( 6 min )
  • Open

    Creating 3D models using NVIDIA Get3D
    submitted by /u/oridnary_artist [link] [comments]  ( 49 min )
    Accurate and Explainable Image-based Prediction Using a Lightweight Generative Model
    submitted by /u/pasticciociccio [link] [comments]  ( 50 min )
    Is my idea of multilayer perceptron fully correct?
    I just watched a video to better my understanding and it said that the input to the nueral network is MxN where M is the batch size, and N is the input size (so for a xor problem, that would be 4x2 I guess?). I have always imagined it where the input is Nx1 (N is the input size), and to train it with multiple training data you just for item in training_data: nn.train(item); Also, if the input is indeed MxN matrix instead of Nx1, how would matrix multiplication work? It would have to be 3D. Can someone clarify please? Thanks submitted by /u/mrbeanshooter123 [link] [comments]  ( 60 min )
  • Open

    How AI Proof of Concept Helps You Succeed in Your AI Endeavor
    Our client lost only a quarter of the budget they dedicated to an AI project because they chose to start with a proof of concept. The PoC…  ( 24 min )
    Machine learning Requires Blink Skills
    These skills might not be visiable but they are important for ML and DL  ( 10 min )
  • Open

    [D] "Bitter lesson 2.0", Karol Hausman {G}: DRL robotics benefits more from improvements in pretrained models than robotics-specific innovation?
    submitted by /u/gwern [link] [comments]  ( 60 min )
    Help with training and reloading a model
    Say you partially train a model for say 50000 steps. Is it possible to once its finished you wish to reload that same trained model and continue training it for an additional say 20000 steps. I have a partially trained DQN but its not performing as well as it should and would like to continue the training but I am not sure if it is possible or will I just have to train an entirely new model. ​ I've loaded my "hope_run" model and checked it with evaluate policy, and it seems to do well with maybe the first 30% of the environment (a custom drone obstacle course). I would like to continue the training where it left off without having to start over. ​ Is this possible? ​ https://preview.redd.it/s0msc9jontba1.png?width=1918&format=png&auto=webp&s=3779568fc3adde5717171879b26cdc025b3c87e4 submitted by /u/CJPeso [link] [comments]  ( 55 min )
    Working RLLlib agent with hyperparameters for a MuJoCo environment
    Do you know any repository containing both an environment in MuJoCo with a Franka Emika robot (easy to modify) and a working agent in RLLib, where by "working agent" I mean that they provide also the hyperparameters for successfully solve a task. It is ok also if you can suggest 2 separated repositories (one with the environment and one with the agent), but the most important thing is to have the hyperparameters. ​ For example I found Robosuite, a simulation framework in MuJoCo, and they also provide a benchmarking repository to solve few tasks. Unfortunately, the code of the environment is too much complex to be customized and the agent is implemented in rlkit (also quite complicated to be modified for me). submitted by /u/riccardogauss [link] [comments]  ( 60 min )
    Standard MARL books?
    Hi, Just starting my PhD and I'm looking a thorough book on MARL to use as a reference. I'm basically looking for the MARL equivalent of Sutton & Barto's Reinforcement Learning. I'm going to ask my supervisor when we meet later today but I thought I'd ask here too. I did search in multiple places before posting and found nothing, but if there's existing threads I missed please feel free to point me in their direction. Thanks! submitted by /u/luddite_ai_enjoyer [link] [comments]  ( 56 min )
  • Open

    What is difference between Logistic Regression Model and Desicion Boundary? [D]
    I am taking course about Supervised Learning but the lecturer haven't clarified the borderline between these terms. submitted by /u/javamak [link] [comments]  ( 59 min )
    [D] MADE: Masked Autoencoder for Density Estimation
    I read the [MADE: Masked Autoencoder for Density Estimation](https://arxiv.org/abs/1502.03509) paper and had a look at this [Blog](https://www.ritchievink.com/blog/2019/10/25/distribution-estimation-with-masked-autoencoders/), but it I don't understand the followidng thing in the examples used in both of them: One result of the masking is that one input is simply not used(?). Another one is that one output node has no conditions, i.e. it does not depend on any of the inputs. But what is its actual output value? Is it random? Constant? If yes, how is it chosen? submitted by /u/lsov2 [link] [comments]  ( 60 min )
    [D] Combining Machine Learning + Expert Knowledge (Question for Agriculture Research)
    Hey guys, I am working the sector of computer science for agriculture research. I deal here with algorithm to monitor crop conditions and try to simulate what yield will be the outcome. I am focussing on ML based methods, but data in agriculture can be a quite limiting factor. If you have 100k samples from real crop fields, thats a lot! So we are not like ChatGPT, who just used 500bn word samples to train their model. To overcome the issues of small data + ML, I want to set up an approach that combines ML methods (learning from data) with expert knowledge. What do I mean by this: E.g. Everybody knows, if you do not water your plant, it will die. Or if there are 90° Celsius, the plant will just burn. This knowledge is partially stored in so called "crop simulation models" designed by agronomy experts and my idea was to use these expert models to generate synthetic yield data and feed this data into the training dataset for the ML models. For me that will somehow result in an approach of "constrained machine learning" where I want to combine both. However, does some of you have any other idea how ML and expert models could be combined or the knowledge could be injected to ML methods, except via the training dataset? I am happy to hear your suggestions! submitted by /u/Tigmib [link] [comments]  ( 61 min )
    [D] Mtruk alternatives for extracting information out of text
    I need some validation samples for an information extraction task, basically extracting a list of objects with 4 fields from a text (+ a binary flag). I intended to use mturk for this, but they seem to have some billing issues and I haven't managed to have them allow us to actually spend any money in a week. I've looked at a few alternatives but most seem very small and focused on simple tasks and surveys. Have any of you successfully used something other than mturk for this kind of task? submitted by /u/elcric_krej [link] [comments]  ( 58 min )
    [D] Is there a community for ACL2023 authors?
    Just wondering is there a community like a telegram or discord group for the ACL 2023 authors to share information. submitted by /u/OneMasterpiece1717 [link] [comments]  ( 59 min )
    Why is Super Learning / Stacking used rather rarely in practice? [D]
    Basically what the titel says. For me it seems that neither in business nor in literature Super Learners / Stacking is used frequently. Therefore I was wondering why this is the case? Especially since Stacking should guarantee at least equal performance as the base learners used for it. One reason that comes up my mind is the curse of data. As more levels in the architecture we have the more data splits are needed, reducing the available training data for each individual model, thus reducing the model performance. Another thing might be the complexity when building a Stacked Learner. Still that doesn’t see to be that bad of a trade-off. Anything I‘m totally missing here? submitted by /u/Worth-Advance-1232 [link] [comments]  ( 58 min )
    [D] Bitter lesson 2.0?
    This twitter thread from Karol Hausman talks about the original bitter lesson and suggests a bitter lesson 2.0. https://twitter.com/hausman_k/status/1612509549889744899 "The biggest lesson that [will] be read from [the next] 70 years of AI research is that general methods that leverage foundation models are ultimately the most effective" Seems to be derived by observing that the most promising work in robotics today (where generating data is challenging) is coming from piggy-backing on the success of large language models (think SayCan etc). Any hot takes? submitted by /u/Tea_Pearce [link] [comments]  ( 64 min )
    [N] VizWiz Launches 4 AI Challenges to help blind/low vision community
    Greetings! We are pleased to announce the fourth annual VizWiz Grand Challenge workshop, which will be held in conjunction with CVPR 2023. The workshop is running 4 AI Challenges to drive the development of assistive technologies for people who are blind or low-vision. Please share this post with those who might be interested in participating. This workshop is motivated in part by our observation that people who are blind have relied on (human-based) visual assistance services to learn about images and videos they capture for over a decade. We introduce visual question answering, few shot recognition, and object localization dataset challenges for the AI community to represent authentic use cases. A few more details: · Friday, May 5: submissions of algorithm results due to the evaluation server · Monday, June 19: results will be announced at the VizWiz Grand Challenge workshop at CVPR 2023 · VQA Challenge here · VQA Grounding Challenge here · Few-Shot Object Recognition Challenge here · Salient Object Detection Challenge here We are looking forward to your participation in the Challenges this year! submitted by /u/eee-vaaah [link] [comments]  ( 58 min )
  • Open

    SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration. (arXiv:2211.13743v3 [cs.LG] UPDATED)
    The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.  ( 2 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    Improving ECG-based COVID-19 diagnosis and mortality predictions using pre-pandemic medical records at population-scale. (arXiv:2211.10431v2 [eess.SP] UPDATED)
    Pandemic outbreaks such as COVID-19 occur unexpectedly, and need immediate action due to their potential devastating consequences on global health. Point-of-care routine assessments such as electrocardiogram (ECG), can be used to develop prediction models for identifying individuals at risk. However, there is often too little clinically-annotated medical data, especially in early phases of a pandemic, to develop accurate prediction models. In such situations, historical pre-pandemic health records can be utilized to estimate a preliminary model, which can then be fine-tuned based on limited available pandemic data. This study shows this approach -- pre-train deep learning models with pre-pandemic data -- can work effectively, by demonstrating substantial performance improvement over three different COVID-19 related diagnostic and prognostic prediction tasks. Similar transfer learning strategies can be useful for developing timely artificial intelligence solutions in future pandemic outbreaks.  ( 2 min )
    NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation. (arXiv:2208.05117v3 [cs.LG] UPDATED)
    Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non-i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption. Code is available at https://github.com/TaesikGong/NOTE.  ( 2 min )
    Composite Feature Selection using Deep Ensembles. (arXiv:2211.00631v2 [cs.LG] UPDATED)
    In many real world problems, features do not act alone but in combination with each other. For example, in genomics, diseases might not be caused by any single mutation but require the presence of multiple mutations. Prior work on feature selection either seeks to identify individual features or can only determine relevant groups from a predefined set. We investigate the problem of discovering groups of predictive features without predefined grouping. To do so, we define predictive groups in terms of linear and non-linear interactions between features. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups, without requiring candidate groups to be provided. The selected groups are sparse and exhibit minimum overlap. Furthermore, we propose a new metric to measure similarity between discovered groups and the ground truth. We demonstrate the utility of our model on multiple synthetic tasks and semi-synthetic chemistry datasets, where the ground truth structure is known, as well as an image dataset and a real-world cancer dataset.  ( 2 min )
    Max-Min Off-Policy Actor-Critic Method Focusing on Worst-Case Robustness to Model Misspecification. (arXiv:2211.03413v2 [cs.LG] UPDATED)
    In the field of reinforcement learning, because of the high cost and risk of policy training in the real world, policies are trained in a simulation environment and transferred to the corresponding real-world environment. However, the simulation environment does not perfectly mimic the real-world environment, lead to model misspecification. Multiple studies report significant deterioration of policy performance in a real-world environment. In this study, we focus on scenarios involving a simulation environment with uncertainty parameters and the set of their possible values, called the uncertainty parameter set. The aim is to optimize the worst-case performance on the uncertainty parameter set to guarantee the performance in the corresponding real-world environment. To obtain a policy for the optimization, we propose an off-policy actor-critic approach called the Max-Min Twin Delayed Deep Deterministic Policy Gradient algorithm (M2TD3), which solves a max-min optimization problem using a simultaneous gradient ascent descent approach. Experiments in multi-joint dynamics with contact (MuJoCo) environments show that the proposed method exhibited a worst-case performance superior to several baseline approaches.  ( 2 min )
    Learning Graph Search Heuristics. (arXiv:2212.03978v2 [cs.LG] UPDATED)
    Searching for a path between two nodes in a graph is one of the most well-studied and fundamental problems in computer science. In numerous domains such as robotics, AI, or biology, practitioners develop search heuristics to accelerate their pathfinding algorithms. However, it is a laborious and complex process to hand-design heuristics based on the problem and the structure of a given use case. Here we present PHIL (Path Heuristic with Imitation Learning), a novel neural architecture and a training algorithm for discovering graph search and navigation heuristics from data by leveraging recent advances in imitation learning and graph representation learning. At training time, we aggregate datasets of search trajectories and ground-truth shortest path distances, which we use to train a specialized graph neural network-based heuristic function using backpropagation through steps of the pathfinding process. Our heuristic function learns graph embeddings useful for inferring node distances, runs in constant time independent of graph sizes, and can be easily incorporated in an algorithm such as A* at test time. Experiments show that PHIL reduces the number of explored nodes compared to state-of-the-art methods on benchmark datasets by 58.5\% on average, can be directly applied in diverse graphs ranging from biological networks to road networks, and allows for fast planning in time-critical robotics domains.  ( 2 min )
    Intra-session Context-aware Feed Recommendation in Live Systems. (arXiv:2210.07815v2 [cs.IR] UPDATED)
    Feed recommendation allows users to constantly browse items until feel uninterested and leave the session, which differs from traditional recommendation scenarios. Within a session, user's decision to continue browsing or not substantially affects occurrences of later clicks. However, such type of exposure bias is generally ignored or not explicitly modeled in most feed recommendation studies. In this paper, we model this effect as part of intra-session context, and propose a novel intra-session Context-aware Feed Recommendation (INSCAFER) framework to maximize the total views and total clicks simultaneously. User click and browsing decisions are jointly learned by a multi-task setting, and the intra-session context is encoded by the session-wise exposed item sequence. We deploy our model online with all key business benchmarks improved. Our method sheds some lights on feed recommendation studies which aim to optimize session-level click and view metrics.  ( 2 min )
    Auto-Encoder Neural Network Incorporating X-Ray Fluorescence Fundamental Parameters with Machine Learning. (arXiv:2210.12239v2 [cs.LG] UPDATED)
    We consider energy-dispersive X-ray Fluorescence (EDXRF) applications where the fundamental parameters method is impractical such as when instrument parameters are unavailable. For example, on a mining shovel or conveyor belt, rocks are constantly moving (leading to varying angles of incidence and distances) and there may be other factors not accounted for (like dust). Neural networks do not require instrument and fundamental parameters but training neural networks requires XRF spectra labelled with elemental composition, which is often limited because of its expense. We develop a neural network model that learns from limited labelled data and learns to invert a forward model. The forward model uses transition energies and probabilities of all elements and parameterized distributions to approximate other fundamental and instrument parameters. We evaluate the model and baseline models on a rock dataset from a lithium mineral exploration project and identify which elements are appropriate for this method. This model demonstrates the potential to calibrate a neural network in a noisy environment where labelled data is limited.  ( 2 min )
    Exploration of Parameter Spaces Assisted by Machine Learning. (arXiv:2207.09959v3 [hep-ph] UPDATED)
    We demonstrate two sampling procedures assisted by machine learning models via regression and classification. The main objective is the use of a neural network to suggest points likely inside regions of interest, reducing the number of evaluations of time consuming calculations. We compare results from this approach with results from other sampling methods, namely Markov chain Monte Carlo and MultiNest, obtaining results that range from comparably similar to arguably better. In particular, we augment our classifier method with a boosting technique that rapidly increases the efficiency within a few iterations. We show results from our methods applied to a toy model and the type II 2HDM, using 3 and 7 free parameters, respectively. The code used for this paper and instructions are publicly available on the web.  ( 2 min )
    OneRing: A Simple Method for Source-free Open-partial Domain Adaptation. (arXiv:2206.03600v2 [cs.CV] UPDATED)
    In this paper, we investigate Source-free Open-partial Domain Adaptation (SF-OPDA), which addresses the situation where there exist both domain and category shifts between source and target domains. Under the SF-OPDA setting, which aims to address data privacy concerns, the model cannot access source data anymore during target adaptation. We propose a novel training scheme to learn a (n+1)-way classifier to predict the n source classes and the unknown class, where samples of only known source categories are available for training. Furthermore, for target adaptation, we simply adopt a weighted entropy minimization to adapt the source pretrained model to the unlabeled target domain without source data. In experiments, we show our simple method surpasses current OPDA approaches which demand source data during adaptation. When augmented with a closed-set domain adaptation approach during target adaptation, our source-free method further outperforms the current state-of-the-art OPDA method by 2.5%, 7.2% and 13% on Office-31, Office-Home and VisDA respectively.  ( 2 min )
    ParkPredict+: Multimodal Intent and Motion Prediction for Vehicles in Parking Lots with CNN and Transformer. (arXiv:2204.10777v2 [cs.CV] UPDATED)
    The problem of multimodal intent and trajectory prediction for human-driven vehicles in parking lots is addressed in this paper. Using models designed with CNN and Transformer networks, we extract temporal-spatial and contextual information from trajectory history and local bird's eye view (BEV) semantic images, and generate predictions about intent distribution and future trajectory sequences. Our methods outperform existing models in accuracy, while allowing an arbitrary number of modes, encoding complex multi-agent scenarios, and adapting to different parking maps. To train and evaluate our method, we present the first public 4K video dataset of human driving in parking lots with accurate annotation, high frame rate, and rich traffic scenarios.  ( 2 min )
    Contrastive Neural Ratio Estimation. (arXiv:2210.06170v2 [stat.ML] UPDATED)
    Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.  ( 2 min )
    Learning Invariant Representations under General Interventions on the Response. (arXiv:2208.10027v2 [stat.ME] UPDATED)
    It has become increasingly common nowadays to collect observations of feature and response pairs from different environments. As a consequence, one has to apply learned predictors to data with a different distribution due to distribution shifts. One principled approach is to adopt the structural causal models to describe training and test models, following the invariance principle which says that the conditional distribution of the response given its predictors remains the same across environments. However, this principle might be violated in practical settings when the response is intervened. A natural question is whether it is still possible to identify other forms of invariance to facilitate prediction in unseen environments. To shed light on this challenging scenario, we introduce invariant matching property (IMP) which is an explicit relation to capture interventions through an additional feature. This leads to an alternative form of invariance that enables a unified treatment of general interventions on the response. We analyze the asymptotic generalization errors of our method under both the discrete and continuous environment settings, where the continuous case is handled by relating it to the semiparametric varying coefficient models. We present algorithms that show competitive performance compared to existing methods over various experimental settings including a COVID dataset.  ( 2 min )
    A learning theory for quantum photonic processors and beyond. (arXiv:2209.03075v2 [quant-ph] UPDATED)
    We consider the tasks of learning quantum states, measurements and channels generated by continuous-variable (CV) quantum circuits. This family of circuits is suited to describe optical quantum technologies and in particular it includes state-of-the-art photonic processors capable of showing quantum advantage. We define classes of functions that map classical variables, encoded into the CV circuit parameters, to outcome probabilities evaluated on those circuits. We then establish efficient learnability guarantees for such classes, by computing bounds on their pseudo-dimension or covering numbers, showing that CV quantum circuits can be learned with a sample complexity that scales polynomially with the circuit's size, i.e., the number of modes. Our results establish that CV circuits can be trained efficiently using a number of training samples that, unlike their finite-dimensional counterpart, does not scale with the circuit depth.  ( 2 min )
    Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization. (arXiv:2207.13676v2 [cs.LG] UPDATED)
    Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS) Vizier, a standalone Python-based interface for blackbox optimization and research, based on the Google-internal Vizier infrastructure and framework. OSS Vizier provides an API capable of defining and solving a wide variety of optimization problems, including multi-metric, early stopping, transfer learning, and conditional search. Furthermore, it is designed to be a distributed system that assures reliability, and allows multiple parallel evaluations of the user's objective function. The flexible RPC-based infrastructure allows users to access OSS Vizier from binaries written in any language. OSS Vizier also provides a back-end ("Pythia") API that gives algorithm authors a way to interface new algorithms with the core OSS Vizier system. OSS Vizier is available at https://github.com/google/vizier.  ( 2 min )
    Expected Frequency Matrices of Elections: Computation, Geometry, and Preference Learning. (arXiv:2205.07831v2 [cs.GT] UPDATED)
    We use the ``map of elections'' approach of Szufa et al. (AAMAS-2020) to analyze several well-known vote distributions. For each of them, we give an explicit formula or an efficient algorithm for computing its frequency matrix, which captures the probability that a given candidate appears in a given position in a sampled vote. We use these matrices to draw the ``skeleton map'' of distributions, evaluate its robustness, and analyze its properties. Finally, we develop a general and unified framework for learning the distribution of real-world preferences using the frequency matrices of established vote distributions.  ( 2 min )
    Learning Neural Set Functions Under the Optimal Subset Oracle. (arXiv:2203.01693v3 [cs.LG] UPDATED)
    Learning neural set functions becomes increasingly more important in many applications like product recommendation and compound selection in AI-aided drug discovery. The majority of existing works study methodologies of set function learning under the function value oracle, which, however, requires expensive supervision signals. This renders it impractical for applications with only weak supervisions under the Optimal Subset (OS) oracle, the study of which is surprisingly overlooked. In this work, we present a principled yet practical maximum likelihood learning framework, termed as EquiVSet, that simultaneously meets the following desiderata of learning set functions under the OS oracle: i) permutation invariance of the set mass function being modeled; ii) permission of varying ground set; iii) minimum prior; and iv) scalability. The main components of our framework involve: an energy-based treatment of the set mass function, DeepSet-style architectures to handle permutation invariance, mean-field variational inference, and its amortized variants. Thanks to the elegant combination of these advanced architectures, empirical studies on three real-world applications (including Amazon product recommendation, set anomaly detection, and compound selection for virtual screening) demonstrate that EquiVSet outperforms the baselines by a large margin.  ( 2 min )
    Bitwidth Heterogeneous Federated Learning with Progressive Weight Dequantization. (arXiv:2202.11453v5 [cs.LG] UPDATED)
    In practical federated learning scenarios, the participating devices may have different bitwidths for computation and memory storage by design. However, despite the progress made in device-heterogeneous federated learning scenarios, the heterogeneity in the bitwidth specifications in the hardware has been mostly overlooked. We introduce a pragmatic FL scenario with bitwidth heterogeneity across the participating devices, dubbed as Bitwidth Heterogeneous Federated Learning (BHFL). BHFL brings in a new challenge, that the aggregation of model parameters with different bitwidths could result in severe performance degeneration, especially for high-bitwidth models. To tackle this problem, we propose ProWD framework, which has a trainable weight dequantizer at the central server that progressively reconstructs the low-bitwidth weights into higher bitwidth weights, and finally into full-precision weights. ProWD further selectively aggregates the model parameters to maximize the compatibility across bit-heterogeneous weights. We validate ProWD against relevant FL baselines on the benchmark datasets, using clients with varying bitwidths. Our ProWD largely outperforms the baseline FL algorithms as well as naive approaches (e.g. grouped averaging) under the proposed BHFL scenario.  ( 2 min )
    Padding Module: Learning the Padding in Deep Neural Networks. (arXiv:2301.04608v1 [cs.CV])
    During the last decades, many studies have been dedicated to improving the performance of neural networks, for example, the network architectures, initialization, and activation. However, investigating the importance and effects of learnable padding methods in deep learning remains relatively open. To mitigate the gap, this paper proposes a novel trainable Padding Module that can be placed in a deep learning model. The Padding Module can optimize itself without requiring or influencing the model's entire loss function. To train itself, the Padding Module constructs a ground truth and a predictor from the inputs by leveraging the underlying structure in the input data for supervision. As a result, the Padding Module can learn automatically to pad pixels to the border of its input images or feature maps. The padding contents are realistic extensions to its input data and simultaneously facilitate the deep learning model's downstream task. Experiments have shown that the proposed Padding Module outperforms the state-of-the-art competitors and the baseline methods. For example, the Padding Module has 1.23% and 0.44% more classification accuracy than the zero padding when tested on the VGG16 and ResNet50.  ( 2 min )
    When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning. (arXiv:2206.13464v3 [cs.LG] UPDATED)
    Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.  ( 2 min )
    Reducing Exploitability with Population Based Training. (arXiv:2208.05083v3 [cs.LG] UPDATED)
    Self-play reinforcement learning has achieved state-of-the-art, and often superhuman, performance in a variety of zero-sum games. Yet prior work has found that policies that are highly capable against regular opponents can fail catastrophically against adversarial policies: an opponent trained explicitly against the victim. Prior defenses using adversarial training were able to make the victim robust to a specific adversary, but the victim remained vulnerable to new ones. We conjecture this limitation was due to insufficient diversity of adversaries seen during training. We analyze a defense using population based training to pit the victim against a diverse set of opponents. We evaluate this defense's robustness against new adversaries in two low-dimensional environments. This defense increases robustness against adversaries, as measured by the number of attacker training timesteps to exploit the victim. Furthermore, we show that robustness is correlated with the size of the opponent population.  ( 2 min )
    Investigating the Properties of Neural Network Representations in Reinforcement Learning. (arXiv:2203.15955v2 [cs.LG] UPDATED)
    In this paper we investigate the properties of representations learned by deep reinforcement learning systems. Much of the early work on representations for reinforcement learning focused on designing fixed-basis architectures to achieve properties thought to be desirable, such as orthogonality and sparsity. In contrast, the idea behind deep reinforcement learning methods is that the agent designer should not encode representational properties, but rather that the data stream should determine the properties of the representation -- good representations emerge under appropriate training schemes. In this paper we bring these two perspectives together, empirically investigating the properties of representations that support transfer in reinforcement learning. We introduce and measure six representational properties over more than 25 thousand agent-task settings. We consider Deep Q-learning agents with different auxiliary losses in a pixel-based navigation environment, with source and transfer tasks corresponding to different goal locations. We develop a method to better understand why some representations work better for transfer, through a systematic approach varying task similarity and measuring and correlating representation properties with transfer performance. We demonstrate the generality of the methodology by investigating representations learned by a Rainbow agent that successfully transfer across games modes in Atari 2600.  ( 2 min )
    Contextual Squeeze-and-Excitation for Efficient Few-Shot Image Classification. (arXiv:2206.09843v3 [cs.CV] UPDATED)
    Recent years have seen a growth in user-centric applications that require effective knowledge transfer across tasks in the low-data regime. An example is personalization, where a pretrained system is adapted by learning on small amounts of labeled data belonging to a specific user. This setting requires high accuracy under low computational complexity, therefore the Pareto frontier of accuracy vs. adaptation cost plays a crucial role. In this paper we push this Pareto frontier in the few-shot image classification setting with a key contribution: a new adaptive block called Contextual Squeeze-and-Excitation (CaSE) that adjusts a pretrained neural network on a new task to significantly improve performance with a single forward pass of the user data (context). We use meta-trained CaSE blocks to conditionally adapt the body of a network and a fine-tuning routine to adapt a linear head, defining a method called UpperCaSE. UpperCaSE achieves a new state-of-the-art accuracy relative to meta-learners on the 26 datasets of VTAB+MD and on a challenging real-world personalization benchmark (ORBIT), narrowing the gap with leading fine-tuning methods with the benefit of orders of magnitude lower adaptation cost.  ( 2 min )
    SnAKe: Bayesian Optimization with Pathwise Exploration. (arXiv:2202.00060v4 [cs.LG] UPDATED)
    Bayesian Optimization is a very effective tool for optimizing expensive black-box functions. Inspired by applications developing and characterizing reaction chemistry using droplet microfluidic reactors, we consider a novel setting where the expense of evaluating the function can increase significantly when making large input changes between iterations. We further assume we are working asynchronously, meaning we have to select new queries before evaluating previous experiments. This paper investigates the problem and introduces 'Sequential Bayesian Optimization via Adaptive Connecting Samples' (SnAKe), which provides a solution by considering large batches of queries and preemptively building optimization paths that minimize input costs. We investigate some convergence properties and empirically show that the algorithm is able to achieve regret similar to classical Bayesian Optimization algorithms in both synchronous and asynchronous settings, while reducing input costs significantly. We show the method is robust to the choice of its single hyper-parameter and provide a parameter-free alternative.  ( 2 min )
    Towards Backdoor Attacks and Defense in Robust Machine Learning Models. (arXiv:2003.00865v4 [cs.CV] UPDATED)
    The introduction of robust optimisation has pushed the state-of-the-art in defending against adversarial attacks. Notably, the state-of-the-art projected gradient descent (PGD)-based training method has been shown to be universally and reliably effective in defending against adversarial inputs. This robustness approach uses PGD as a reliable and universal "first-order adversary". However, the behaviour of such optimisation has not been studied in the light of a fundamentally different class of attacks called backdoors. In this paper, we study how to inject and defend against backdoor attacks for robust models trained using PGD-based robust optimisation. We demonstrate that these models are susceptible to backdoor attacks. Subsequently, we observe that backdoors are reflected in the feature representation of such models. Then, this observation is leveraged to detect such backdoor-infected models via a detection technique called AEGIS. Specifically, given a robust Deep Neural Network (DNN) that is trained using PGD-based first-order adversarial training approach, AEGIS uses feature clustering to effectively detect whether such DNNs are backdoor-infected or clean. In our evaluation of several visible and hidden backdoor triggers on major classification tasks using CIFAR-10, MNIST and FMNIST datasets, AEGIS effectively detects PGD-trained robust DNNs infected with backdoors. AEGIS detects such backdoor-infected models with 91.6% accuracy (11 out of 12 tested models), without any false positives. Furthermore, AEGIS detects the targeted class in the backdoor-infected model with a reasonably low (11.1%) false positive rate. Our investigation reveals that salient features of adversarially robust DNNs could be promising to break the stealthy nature of backdoor attacks.  ( 3 min )
    Towards a unified nonlocal, peridynamics framework for the coarse-graining of molecular dynamics data with fractures. (arXiv:2301.04540v1 [cond-mat.mtrl-sci])
    Molecular dynamics (MD) has served as a powerful tool for designing materials with reduced reliance on laboratory testing. However, the use of MD directly to treat the deformation and failure of materials at the mesoscale is still largely beyond reach. Herein, we propose a learning framework to extract a peridynamic model as a mesoscale continuum surrogate from MD simulated material fracture datasets. Firstly, we develop a novel coarse-graining method, to automatically handle the material fracture and its corresponding discontinuities in MD displacement dataset. Inspired by the Weighted Essentially Non-Oscillatory scheme, the key idea lies at an adaptive procedure to automatically choose the locally smoothest stencil, then reconstruct the coarse-grained material displacement field as piecewise smooth solutions containing discontinuities. Then, based on the coarse-grained MD data, a two-phase optimization-based learning approach is proposed to infer the optimal peridynamics model with damage criterion. In the first phase, we identify the optimal nonlocal kernel function from datasets without material damage, to capture the material stiffness properties. Then, in the second phase, the material damage criterion is learnt as a smoothed step function from the data with fractures. As a result, a peridynamics surrogate is obtained. Our peridynamics surrogate model can be employed in further prediction tasks with different grid resolutions from training, and hence allows for substantial reductions in computational cost compared with MD. We illustrate the efficacy of the proposed approach with several numerical tests for single layer graphene. Our tests show that the proposed data-driven model is robust and generalizable: it is capable in modeling the initialization and growth of fractures under discretization and loading settings that are different from the ones used during training.  ( 2 min )
    Interpretable Hidden Markov Model-Based Deep Reinforcement Learning Hierarchical Framework for Predictive Maintenance of Turbofan Engines. (arXiv:2206.13433v2 [cs.LG] UPDATED)
    An open research question in deep reinforcement learning is how to focus the policy learning of key decisions within a sparse domain. This paper emphasizes combining the advantages of inputoutput hidden Markov models and reinforcement learning towards interpretable maintenance decisions. We propose a novel hierarchical-modeling methodology that, at a high level, detects and interprets the root cause of a failure as well as the health degradation of the turbofan engine, while, at a low level, it provides the optimal replacement policy. It outperforms the baseline performance of deep reinforcement learning methods applied directly to the raw data or when using a hidden Markov model without such a specialized hierarchy. It also provides comparable performance to prior work, however, with the additional benefit of interpretability.  ( 2 min )
    FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v4 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that identifies and suppresses those data sources that would have a negative impact on fairness or accuracy if they were used for training. As such, FLEA is not a replacement of prior fairness-aware learning methods but rather an augmentation that makes any of them robust against unreliable training data. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that -- given enough data -- FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half. Our source code and documentation are available at https://github.com/ISTAustria-CVML/FLEA.  ( 2 min )
    Convex Analysis at Infinity: An Introduction to Astral Space. (arXiv:2205.03260v2 [math.OC] UPDATED)
    Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.  ( 2 min )
    Efficient Natural Gradient Descent Methods for Large-Scale PDE-Based Optimization Problems. (arXiv:2202.06236v3 [math.OC] UPDATED)
    We propose efficient numerical schemes for implementing the natural gradient descent (NGD) for a broad range of metric spaces with applications to PDE-based optimization problems. Our technique represents the natural gradient direction as a solution to a standard least-squares problem. Hence, instead of calculating, storing, or inverting the information matrix directly, we apply efficient methods from numerical linear algebra. We treat both scenarios where the Jacobian, i.e., the derivative of the state variable with respect to the parameter, is either explicitly known or implicitly given through constraints. We can thus reliably compute several natural NGDs for a large-scale parameter space. In particular, we are able to compute Wasserstein NGD in thousands of dimensions, which was believed to be out of reach. Finally, our numerical results shed light on the qualitative differences between the standard gradient descent and various NGD methods based on different metric spaces in nonconvex optimization problems.  ( 2 min )
    Causal Discovery from Sparse Time-Series Data Using Echo State Network. (arXiv:2201.02933v2 [cs.LG] UPDATED)
    Causal discovery between collections of time-series data can help diagnose causes of symptoms and hopefully prevent faults before they occur. However, reliable causal discovery can be very challenging, especially when the data acquisition rate varies (i.e., non-uniform data sampling), or in the presence of missing data points (e.g., sparse data sampling). To address these issues, we proposed a new system comprised of two parts, the first part fills missing data with a Gaussian Process Regression, and the second part leverages an Echo State Network, which is a type of reservoir computer (i.e., used for chaotic system modelling) for Causal discovery. We evaluate the performance of our proposed system against three other off-the-shelf causal discovery algorithms, namely, structural expectation-maximization, sub-sampled linear auto-regression absolute coefficients, and multivariate Granger Causality with vector auto-regressive using the Tennessee Eastman chemical dataset; we report on their corresponding Matthews Correlation Coefficient(MCC) and Receiver Operating Characteristic curves (ROC) and show that the proposed system outperforms existing algorithms, demonstrating the viability of our approach to discover causal relationships in a complex system with missing entries.  ( 2 min )
    "You Can't Fix What You Can't Measure": Privately Measuring Demographic Performance Disparities in Federated Learning. (arXiv:2206.12183v2 [cs.LG] UPDATED)
    As in traditional machine learning models, models trained with federated learning may exhibit disparate performance across demographic groups. Model holders must identify these disparities to mitigate undue harm to the groups. However, measuring a model's performance in a group requires access to information about group membership which, for privacy reasons, often has limited availability. We propose novel locally differentially private mechanisms to measure differences in performance across groups while protecting the privacy of group membership. To analyze the effectiveness of the mechanisms, we bound their error in estimating a disparity when optimized for a given privacy budget. Our results show that the error rapidly decreases for realistic numbers of participating clients, demonstrating that, contrary to what prior work suggested, protecting privacy is not necessarily in conflict with identifying performance disparities of federated models.  ( 2 min )
    On the Complexity of Computing Markov Perfect Equilibrium in General-Sum Stochastic Games. (arXiv:2109.01795v2 [cs.GT] UPDATED)
    Similar to the role of Markov decision processes in reinforcement learning, Stochastic Games (SGs) lay the foundation for the study of multi-agent reinforcement learning (MARL) and sequential agent interactions. In this paper, we derive that computing an approximate Markov Perfect Equilibrium (MPE) in a finite-state discounted Stochastic Game within the exponential precision is \textbf{PPAD}-complete. We adopt a function with a polynomially bounded description in the strategy space to convert the MPE computation to a fixed-point problem, even though the stochastic game may demand an exponential number of pure strategies, in the number of states, for each agent. The completeness result follows the reduction of the fixed-point problem to {\sc End of the Line}. Our results indicate that finding an MPE in SGs is highly unlikely to be \textbf{NP}-hard unless \textbf{NP}=\textbf{co-NP}. Our work offers confidence for MARL research to study MPE computation on general-sum SGs and to develop fruitful algorithms as currently on zero-sum SGs.  ( 2 min )
    When does return-conditioned supervised learning work for offline reinforcement learning?. (arXiv:2206.01079v3 [cs.LG] UPDATED)
    Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.  ( 2 min )
    EINNs: Epidemiologically-informed Neural Networks. (arXiv:2202.10446v2 [cs.LG] UPDATED)
    We introduce EINNs, a framework crafted for epidemic forecasting that builds upon the theoretical grounds provided by mechanistic models as well as the data-driven expressibility afforded by AI models, and their capabilities to ingest heterogeneous information. Although neural forecasting models have been successful in multiple tasks, predictions well-correlated with epidemic trends and long-term predictions remain open challenges. Epidemiological ODE models contain mechanisms that can guide us in these two tasks; however, they have limited capability of ingesting data sources and modeling composite signals. Thus, we propose to leverage work in physics-informed neural networks to learn latent epidemic dynamics and transfer relevant knowledge to another neural network which ingests multiple data sources and has more appropriate inductive bias. In contrast with previous work, we do not assume the observability of complete dynamics and do not need to numerically solve the ODE equations during training. Our thorough experiments on all US states and HHS regions for COVID-19 and influenza forecasting showcase the clear benefits of our approach in both short-term and long-term forecasting as well as in learning the mechanistic dynamics over other non-trivial alternatives.  ( 2 min )
    Label-Efficient Self-Supervised Federated Learning for Tackling Data Heterogeneity in Medical Imaging. (arXiv:2205.08576v2 [cs.CV] UPDATED)
    The collection and curation of large-scale medical datasets from multiple institutions is essential for training accurate deep learning models, but privacy concerns often hinder data sharing. Federated learning (FL) is a promising solution that enables privacy-preserving collaborative learning among different institutions, but it generally suffers from performance deterioration due to heterogeneous data distributions and a lack of quality labeled data. In this paper, we present a robust and label-efficient self-supervised FL framework for medical image analysis. Our method introduces a novel Transformer-based self-supervised pre-training paradigm that pre-trains models directly on decentralized target task datasets using masked image modeling, to facilitate more robust representation learning on heterogeneous data and effective knowledge transfer to downstream models. Extensive empirical results on simulated and real-world medical imaging non-IID federated datasets show that masked image modeling with Transformers significantly improves the robustness of models against various degrees of data heterogeneity. Notably, under severe data heterogeneity, our method, without relying on any additional pre-training data, achieves an improvement of 5.06%, 1.53% and 4.58% in test accuracy on retinal, dermatology and chest X-ray classification compared to the supervised baseline with ImageNet pre-training. In addition, we show that our federated self-supervised pre-training methods yield models that generalize better to out-of-distribution data and perform more effectively when fine-tuning with limited labeled data, compared to existing FL algorithms. The code is available at https://github.com/rui-yan/SSL-FL.  ( 2 min )
    Fast Multi-view Clustering via Ensembles: Towards Scalability, Superiority, and Simplicity. (arXiv:2203.11572v2 [cs.LG] UPDATED)
    Despite significant progress, there remain three limitations to the previous multi-view clustering algorithms. First, they often suffer from high computational complexity, restricting their feasibility for large-scale datasets. Second, they typically fuse multi-view information via one-stage fusion, neglecting the possibilities in multi-stage fusions. Third, dataset-specific hyperparameter-tuning is frequently required, further undermining their practicability. In light of this, we propose a fast multi-view clustering via ensembles (FastMICE) approach. Particularly, the concept of random view groups is presented to capture the versatile view-wise relationships, through which the hybrid early-late fusion strategy is designed to enable efficient multi-stage fusions. With multiple views extended to many view groups, three levels of diversity (w.r.t. features, anchors, and neighbors, respectively) are jointly leveraged for constructing the view-sharing bipartite graphs in the early-stage fusion. Then, a set of diversified base clusterings for different view groups are obtained via fast graph partitioning, which are further formulated into a unified bipartite graph for final clustering in the late-stage fusion. Notably, FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning. Experiments on 22 multi-view datasets demonstrate its advantages in scalability (for extremely large datasets), superiority (in clustering performance), and simplicity (to be applied) over the state-of-the-art. Code available: https://github.com/huangdonghere/FastMICE.  ( 2 min )
    Convex Surrogate Loss Functions for Contextual Pricing with Transaction Data. (arXiv:2202.10944v2 [cs.LG] UPDATED)
    We study an off-policy contextual pricing problem where the seller has access to samples of prices that customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the previous pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this setting that can be directly optimized to find an effective pricing policy with expected revenue guarantees, without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We propose generalized hinge and quantile pricing loss functions that price at a multiplicative factor of the conditional expected valuation or a particular quantile of the prices that sold, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and we provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.  ( 2 min )
    Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence. (arXiv:2105.11066v4 [cs.LG] UPDATED)
    Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer. Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (arXiv:2102.00135), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.  ( 2 min )
    Pruning Compact ConvNets for Efficient Inference. (arXiv:2301.04502v1 [cs.CV])
    Neural network pruning is frequently used to compress over-parameterized networks by large amounts, while incurring only marginal drops in generalization performance. However, the impact of pruning on networks that have been highly optimized for efficient inference has not received the same level of attention. In this paper, we analyze the effect of pruning for computer vision, and study state-of-the-art ConvNets, such as the FBNetV3 family of models. We show that model pruning approaches can be used to further optimize networks trained through NAS (Neural Architecture Search). The resulting family of pruned models can consistently obtain better performance than existing FBNetV3 models at the same level of computation, and thus provide state-of-the-art results when trading off between computational complexity and generalization performance on the ImageNet benchmark. In addition to better generalization performance, we also demonstrate that when limited computation resources are available, pruning FBNetV3 models incur only a fraction of GPU-hours involved in running a full-scale NAS.  ( 2 min )
    SemSup: Semantic Supervision for Simple and Scalable Zero-shot Generalization. (arXiv:2202.13100v3 [cs.LG] UPDATED)
    Zero-shot learning is the problem of predicting instances over classes not seen during training. One approach to zero-shot learning is providing auxiliary class information to the model. Prior works along this vein have largely used expensive per-instance annotation or singular class-level descriptions, but per-instance descriptions are hard to scale and single class descriptions may not be rich enough. Furthermore, these works have used natural-language descriptions exclusively, simple biencoders models, and modality or task specific methods. These approaches have several limitations: text supervision may not always be available or optimal and biencoders may only learn coarse relations between inputs and class descriptions. In this work, we present SemSup, a novel approach that uses (1) a scalable multiple description sampling method which improves performance over single descriptions, (2) alternative description formats such as JSON that are easy to generate and outperform text on certain settings, and (3) hybrid lexical-semantic similarity to leverage fine-grained information in class descriptions. We demonstrate the effectiveness of SemSup across four datasets, two modalities, and three generalization settings. For example, across text and image datasets, SemSup increases unseen class generalization accuracy by 15 points on average compared to the closest baseline.  ( 2 min )
    DA-MUSIC: Data-Driven DoA Estimation via Deep Augmented MUSIC Algorithm. (arXiv:2109.10581v5 [eess.SP] UPDATED)
    Direction of arrival (DoA) estimation of multiple signals is pivotal in sensor array signal processing. A popular multi-signal DoA estimation method is the multiple signal classification (MUSIC) algorithm, which enables high-performance super-resolution DoA recovery while being highly applicable in practice. MUSIC is a model-based algorithm, relying on an accurate mathematical description of the relationship between the signals and the measurements and assumptions on the signals themselves (non-coherent, narrowband sources). As such, it is sensitive to model imperfections. In this work we propose to overcome these limitations of MUSIC by augmenting the algorithm with specifically designed neural architectures. Our proposed deep augmented MUSIC (DA-MUSIC) algorithm is thus a hybrid model-based/data-driven DoA estimator, which leverages data to improve performance and robustness while preserving the interpretable flow of the classic method. DA-MUSIC is shown to learn to overcome limitations of the purely model-based method, such as its inability to successfully localize coherent sources as well as estimate the number of coherent signal sources present. We further demonstrate the superior resolution of the DA-MUSIC algorithm in synthetic narrowband and broadband scenarios as well as with real-world data of DoA estimation from seismic signals.  ( 2 min )
    Self-Supervised Learning for Biological Sample Localization in 3D Tomographic Images. (arXiv:2011.03353v2 [cs.CV] UPDATED)
    In synchrotron-based Computed Tomography (CT) there is a trade-off between spatial resolution, field of view and speed of positioning and alignment of samples. The problem is even more prominent for high-throughput tomography--an automated setup, capable of scanning large batches of samples without human interaction. As a result, in many applications, only 20-30% of the reconstructed volume contains the actual sample. Such data redundancy clutters the storage and increases processing time. Hence, an automated sample localization becomes an important practical problem. In this work, we describe two self-supervised losses designed for biological CT. We further demonstrate how to employ the uncertainty estimation for sample localization. This approach shows the ability to localize a sample with less than 1.5\% relative error and reduce the used storage by a factor of four. We also show that one of the proposed losses works reasonably well as a pre-training task for the semantic segmentation.  ( 2 min )
    Assessing the Early Bird Heuristic (for Predicting Project Quality). (arXiv:2105.11082v4 [cs.SE] UPDATED)
    Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that the information in those projects "clump" towards the earliest parts of the project. A quality prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this "early bird" data, we can build models very quickly and very early in the project life cycle. Moreover, using this early bird method, we have shown that a simple model (with just a few features) generalizes to hundreds of projects. Based on this experience, we doubt that prior work on generalizing quality models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are available here: https://github.com/snaraya7/early-bird  ( 2 min )
    Adversarial training with informed data selection. (arXiv:2301.04472v1 [cs.LG])
    With the increasing amount of available data and advances in computing capabilities, deep neural networks (DNNs) have been successfully employed to solve challenging tasks in various areas, including healthcare, climate, and finance. Nevertheless, state-of-the-art DNNs are susceptible to quasi-imperceptible perturbed versions of the original images -- adversarial examples. These perturbations of the network input can lead to disastrous implications in critical areas where wrong decisions can directly affect human lives. Adversarial training is the most efficient solution to defend the network against these malicious attacks. However, adversarial trained networks generally come with lower clean accuracy and higher computational complexity. This work proposes a data selection (DS) strategy to be applied in the mini-batch training. Based on the cross-entropy loss, the most relevant samples in the batch are selected to update the model parameters in the backpropagation. The simulation results show that a good compromise can be obtained regarding robustness and standard accuracy, whereas the computational complexity of the backpropagation pass is reduced.  ( 2 min )
    Perceive and predict: self-supervised speech representation based loss functions for speech enhancement. (arXiv:2301.04388v1 [cs.SD])
    Recent work in the domain of speech enhancement has explored the use of self-supervised speech representations to aid in the training of neural speech enhancement models. However, much of this work focuses on using the deepest or final outputs of self supervised speech representation models, rather than the earlier feature encodings. The use of self supervised representations in such a way is often not fully motivated. In this work it is shown that the distance between the feature encodings of clean and noisy speech correlate strongly with psychoacoustically motivated measures of speech quality and intelligibility, as well as with human Mean Opinion Score (MOS) ratings. Experiments using this distance as a loss function are performed and improved performance over the use of STFT spectrogram distance based loss as well as other common loss functions from speech enhancement literature is demonstrated using objective measures such as perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).  ( 2 min )
    A Stochastic Optimization Framework for Fair Risk Minimization. (arXiv:2102.12586v4 [cs.LG] UPDATED)
    Despite the success of large-scale empirical risk minimization (ERM) at achieving high accuracy across a variety of machine learning tasks, fair ERM is hindered by the incompatibility of fairness constraints with stochastic optimization. We consider the problem of fair classification with discrete sensitive attributes and potentially large models and data sets, requiring stochastic solvers. Existing in-processing fairness algorithms are either impractical in the large-scale setting because they require large batches of data at each iteration or they are not guaranteed to converge. In this paper, we develop the first stochastic in-processing fairness algorithm with guaranteed convergence. For demographic parity, equalized odds, and equal opportunity notions of fairness, we provide slight variations of our algorithm--called FERMI--and prove that each of these variations converges in stochastic optimization with any batch size. Empirically, we show that FERMI is amenable to stochastic solvers with multiple (non-binary) sensitive attributes and non-binary targets, performing well even with minibatch size as small as one. Extensive experiments show that FERMI achieves the most favorable tradeoffs between fairness violation and test accuracy across all tested setups compared with state-of-the-art baselines for demographic parity, equalized odds, equal opportunity. These benefits are especially significant with small batch sizes and for non-binary classification with large number of sensitive attributes, making FERMI a practical fairness algorithm for large-scale problems.  ( 2 min )
    Trajectory Modeling via Random Utility Inverse Reinforcement Learning. (arXiv:2105.12092v2 [cs.AI] UPDATED)
    We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. Cars are detected by sensors placed on sparsely distributed points on the street network of a city. As rational agents, drivers are trying to maximize some reward function unknown to an external observer. We apply the concept of random utility from econometrics to model the unknown reward function as a function of observed and unobserved features. In contrast to current inverse reinforcement learning approaches, we do not assume that agents act according to a stochastic policy; rather, we assume that agents act according to a deterministic optimal policy and show that randomness in data arises because the exact rewards are not fully observed by an external observer. We introduce the concept of extended state to cope with unobserved features and develop a Markov decision process formulation of drivers decisions. We present theoretical results which guarantee the existence of solutions and show that maximum entropy inverse reinforcement learning is a particular case of our approach. Finally, we illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city in Brazil.  ( 2 min )
    Continual Few-Shot Learning Using HyperTransformers. (arXiv:2301.04584v1 [cs.LG])
    We focus on the problem of learning without forgetting from multiple tasks arriving sequentially, where each task is defined using a few-shot episode of novel or already seen classes. We approach this problem using the recently published HyperTransformer (HT), a Transformer-based hypernetwork that generates a specialized task-specific CNN weights directly from the support set. In order to learn from a continual sequence of task, we propose to recursively re-use the generated weights as input to the HT for the next task. This way, the generated CNN weights themselves act as a representation of previously learned tasks, and the HT is trained to update these weights so that the new task can be learned without forgetting past tasks. This approach is different from most continual learning algorithms that typically rely on using replay buffers, weight regularization or task-dependent architectural changes. We demonstrate that our proposed Continual HyperTransformer method equipped with a prototypical loss is capable of learning and retaining knowledge about past tasks for a variety of scenarios, including learning from mini-batches, and task-incremental and class-incremental learning scenarios.  ( 2 min )
    Learning fair representation with a parametric integral probability metric. (arXiv:2202.02943v4 [stat.ML] UPDATED)
    As they have a vital effect on social decision-making, AI algorithms should be not only accurate but also fair. Among various algorithms for fairness AI, learning fair representation (LFR), whose goal is to find a fair representation with respect to sensitive variables such as gender and race, has received much attention. For LFR, the adversarial training scheme is popularly employed as is done in the generative adversarial network type algorithms. The choice of a discriminator, however, is done heuristically without justification. In this paper, we propose a new adversarial training scheme for LFR, where the integral probability metric (IPM) with a specific parametric family of discriminators is used. The most notable result of the proposed LFR algorithm is its theoretical guarantee about the fairness of the final prediction model, which has not been considered yet. That is, we derive theoretical relations between the fairness of representation and the fairness of the prediction model built on the top of the representation (i.e., using the representation as the input). Moreover, by numerical experiments, we show that our proposed LFR algorithm is computationally lighter and more stable, and the final prediction model is competitive or superior to other LFR algorithms using more complex discriminators.  ( 2 min )
    Real-time simulation of viscoelastic tissue behavior with physics-guided deep learning. (arXiv:2301.04614v1 [cs.LG])
    Finite element methods (FEM) are popular approaches for simulation of soft tissues with elastic or viscoelastic behavior. However, their usage in real-time applications, such as in virtual reality surgical training, is limited by computational cost. In this application scenario, which typically involves transportable simulators, the computing hardware severely constrains the size or the level of details of the simulated scene. To address this limitation, data-driven approaches have been suggested to simulate mechanical deformations by learning the mapping rules from FEM generated datasets. Herein, we propose a deep learning method for predicting displacement fields of soft tissues with viscoelastic properties. The main contribution of this work is the use of a physics-guided loss function for the optimization of the deep learning model parameters. The proposed deep learning model is based on convolutional (CNN) and recurrent layers (LSTM) to predict spatiotemporal variations. It is augmented with a mass conservation law in the lost function to prevent the generation of physically inconsistent results. The deep learning model is trained on a set of FEM datasets that are generated from a commercially available state-of-the-art numerical neurosurgery simulator. The use of the physics-guided loss function in a deep learning model has led to a better generalization in the prediction of deformations in unseen simulation cases. Moreover, the proposed method achieves a better accuracy over the conventional CNN models, where improvements were observed in unseen tissue from 8% to 30% depending on the magnitude of external forces. It is hoped that the present investigation will help in filling the gap in applying deep learning in virtual reality simulators, hence improving their computational performance (compared to FEM simulations) and ultimately their usefulness.  ( 3 min )
    A Distinct Unsupervised Reference Model From The Environment Helps Continual Learning. (arXiv:2301.04506v1 [cs.LG])
    The existing continual learning methods are mainly focused on fully-supervised scenarios and are still not able to take advantage of unlabeled data available in the environment. Some recent works tried to investigate semi-supervised continual learning (SSCL) settings in which the unlabeled data are available, but it is only from the same distribution as the labeled data. This assumption is still not general enough for real-world applications and restricts the utilization of unsupervised data. In this work, we introduce Open-Set Semi-Supervised Continual Learning (OSSCL), a more realistic semi-supervised continual learning setting in which out-of-distribution (OoD) unlabeled samples in the environment are assumed to coexist with the in-distribution ones. Under this configuration, we present a model with two distinct parts: (i) the reference network captures general-purpose and task-agnostic knowledge in the environment by using a broad spectrum of unlabeled samples, (ii) the learner network is designed to learn task-specific representations by exploiting supervised samples. The reference model both provides a pivotal representation space and also segregates unlabeled data to exploit them more efficiently. By performing a diverse range of experiments, we show the superior performance of our model compared with other competitors and prove the effectiveness of each component of the proposed model.  ( 2 min )
    Exploring the Latent Space of Autoencoders with Interventional Assays. (arXiv:2106.16091v4 [cs.LG] UPDATED)
    Autoencoders exhibit impressive abilities to embed the data manifold into a low-dimensional latent space, making them a staple of representation learning methods. However, without explicit supervision, which is often unavailable, the representation is usually uninterpretable, making analysis and principled progress challenging. We propose a framework, called latent responses, which exploits the locally contractive behavior exhibited by variational autoencoders to explore the learned manifold. More specifically, we develop tools to probe the representation using interventions in the latent space to quantify the relationships between latent variables. We extend the notion of disentanglement to take the learned generative process into account and consequently avoid the limitations of existing metrics that may rely on spurious correlations. Our analyses underscore the importance of studying the causal structure of the representation to improve performance on downstream tasks such as generation, interpolation, and inference of the factors of variation.  ( 2 min )
    Uncertainty Estimation based on Geometric Separation. (arXiv:2301.04452v1 [cs.LG])
    In machine learning, accurately predicting the probability that a specific input is correct is crucial for risk management. This process, known as uncertainty (or confidence) estimation, is particularly important in mission-critical applications such as autonomous driving. In this work, we put forward a novel geometric-based approach for improving uncertainty estimations in machine learning models. Our approach involves using the geometric distance of the current input from existing training inputs as a signal for estimating uncertainty, and then calibrating this signal using standard post-hoc techniques. We demonstrate that our method leads to more accurate uncertainty estimations than recently proposed approaches through extensive evaluation on a variety of datasets and models. Additionally, we optimize our approach so that it can be implemented on large datasets in near real-time applications, making it suitable for time-sensitive scenarios.  ( 2 min )
    Speech Driven Video Editing via an Audio-Conditioned Diffusion Model. (arXiv:2301.04474v1 [cs.CV])
    In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45% using an off the shelf lip reading model. We further demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.  ( 2 min )
    Federated Learning under Heterogeneous and Correlated Client Availability. (arXiv:2301.04632v1 [cs.LG])
    The enormous amount of data produced by mobile and IoT devices has motivated the development of federated learning (FL), a framework allowing such devices (or clients) to collaboratively train machine learning models without sharing their local data. FL algorithms (like FedAvg) iteratively aggregate model updates computed by clients on their own datasets. Clients may exhibit different levels of participation, often correlated over time and with other clients. This paper presents the first convergence analysis for a FedAvg-like FL algorithm under heterogeneous and correlated client availability. Our analysis highlights how correlation adversely affects the algorithm's convergence rate and how the aggregation strategy can alleviate this effect at the cost of steering training toward a biased model. Guided by the theoretical analysis, we propose CA-Fed, a new FL algorithm that tries to balance the conflicting goals of maximizing convergence speed and minimizing model bias. To this purpose, CA-Fed dynamically adapts the weight given to each client and may ignore clients with low availability and large correlation. Our experimental results show that CA-Fed achieves higher time-average accuracy and a lower standard deviation than state-of-the-art AdaFed and F3AST, both on synthetic and real datasets.  ( 2 min )
    Determinate Node Selection for Semi-supervised Classification Oriented Graph Convolutional Networks. (arXiv:2301.04381v1 [cs.LG])
    Graph Convolutional Networks (GCNs) have been proved successful in the field of semi-supervised node classification by extracting structural information from graph data. However, the random selection of labeled nodes used by GCNs may lead to unstable generalization performance of GCNs. In this paper, we propose an efficient method for the deterministic selection of labeled nodes: the Determinate Node Selection (DNS) algorithm. The DNS algorithm identifies two categories of representative nodes in the graph: typical nodes and divergent nodes. These labeled nodes are selected by exploring the structure of the graph and determining the ability of the nodes to represent the distribution of data within the graph. The DNS algorithm can be applied quite simply on a wide range of semi-supervised graph neural network models for node classification tasks. Through extensive experimentation, we have demonstrated that the incorporation of the DNS algorithm leads to a remarkable improvement in the average accuracy of the model and a significant decrease in the standard deviation, as compared to the original method.  ( 2 min )
    Exploring the Approximation Capabilities of Multiplicative Neural Networks for Smooth Functions. (arXiv:2301.04605v1 [cs.LG])
    Multiplication layers are a key component in various influential neural network modules, including self-attention and hypernetwork layers. In this paper, we investigate the approximation capabilities of deep neural networks with intermediate neurons connected by simple multiplication operations. We consider two classes of target functions: generalized bandlimited functions, which are frequently used to model real-world signals with finite bandwidth, and Sobolev-Type balls, which are embedded in the Sobolev Space $\mathcal{W}^{r,2}$. Our results demonstrate that multiplicative neural networks can approximate these functions with significantly fewer layers and neurons compared to standard ReLU neural networks, with respect to both input dimension and approximation error. These findings suggest that multiplicative gates can outperform standard feed-forward layers and have potential for improving neural network design.  ( 2 min )
    Dynamics of a data-driven low-dimensional model of turbulent minimal Couette flow. (arXiv:2301.04638v1 [physics.flu-dyn])
    Because the Navier-Stokes equations are dissipative, the long-time dynamics of a flow in state space are expected to collapse onto a manifold whose dimension may be much lower than the dimension required for a resolved simulation. On this manifold, the state of the system can be exactly described in a coordinate system parameterizing the manifold. Describing the system in this low-dimensional coordinate system allows for much faster simulations and analysis. We show, for turbulent Couette flow, that this description of the dynamics is possible using a data-driven manifold dynamics modeling method. This approach consists of an autoencoder to find a low-dimensional manifold coordinate system and a set of ordinary differential equations defined by a neural network. Specifically, we apply this method to minimal flow unit turbulent plane Couette flow at $\textit{Re}=400$, where a fully resolved solutions requires $\mathcal{O}(10^5)$ degrees of freedom. Using only data from this simulation we build models with fewer than $20$ degrees of freedom that quantitatively capture key characteristics of the flow, including the streak breakdown and regeneration cycle. At short-times, the models track the true trajectory for multiple Lyapunov times, and, at long-times, the models capture the Reynolds stress and the energy balance. For comparison, we show that the models outperform POD-Galerkin models with $\sim$2000 degrees of freedom. Finally, we compute unstable periodic orbits from the models. Many of these closely resemble previously computed orbits for the full system; additionally, we find nine orbits that correspond to previously unknown solutions in the full system.  ( 2 min )
    Fast conformational clustering of extensive molecular dynamics simulation data. (arXiv:2301.04492v1 [physics.chem-ph])
    We present an unsupervised data processing workflow that is specifically designed to obtain a fast conformational clustering of long molecular dynamics simulation trajectories. In this approach we combine two dimensionality reduction algorithms (cc\_analysis and encodermap) with a density-based spatial clustering algorithm (HDBSCAN). The proposed scheme benefits from the strengths of the three algorithms while avoiding most of the drawbacks of the individual methods. Here the cc\_analysis algorithm is for the first time applied to molecular simulation data. Encodermap complements cc\_analysis by providing an efficient way to process and assign large amounts of data to clusters. The main goal of the procedure is to maximize the number of assigned frames of a given trajectory, while keeping a clear conformational identity of the clusters that are found. In practice we achieve this by using an iterative clustering approach and a tunable root-mean-square-deviation-based criterion in the final cluster assignment. This allows to find clusters of different densities as well as different degrees of structural identity. With the help of four test systems we illustrate the capability and performance of this clustering workflow: wild-type and thermostable mutant of the Trp-cage protein (TC5b and TC10b), NTL9 and Protein B. Each of these systems poses individual challenges to the scheme, which in total give a nice overview of the advantages, as well as potential difficulties that can arise when using the proposed method.  ( 3 min )
    Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering. (arXiv:2301.04366v1 [cs.CL])
    We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.  ( 2 min )
    Federated Learning and Blockchain-enabled Fog-IoT Platform for Wearables in Predictive Healthcare. (arXiv:2301.04511v1 [cs.LG])
    Over the years, the popularity and usage of wearable Internet of Things (IoT) devices in several healthcare services are increased. Among the services that benefit from the usage of such devices is predictive analysis, which can improve early diagnosis in e-health. However, due to the limitations of wearable IoT devices, challenges in data privacy, service integrity, and network structure adaptability arose. To address these concerns, we propose a platform using federated learning and private blockchain technology within a fog-IoT network. These technologies have privacy-preserving features securing data within the network. We utilized the fog-IoT network's distributive structure to create an adaptive network for wearable IoT devices. We designed a testbed to examine the proposed platform's ability to preserve the integrity of a classifier. According to experimental results, the introduced implementation can effectively preserve a patient's privacy and a predictive service's integrity. We further investigated the contributions of other technologies to the security and adaptability of the IoT network. Overall, we proved the feasibility of our platform in addressing significant security and privacy challenges of wearable IoT devices in predictive healthcare through analysis, simulation, and experimentation.  ( 2 min )
    BINN: A deep learning approach for computational mechanics problems based on boundary integral equations. (arXiv:2301.04480v1 [cs.LG])
    We proposed the boundary-integral type neural networks (BINN) for the boundary value problems in computational mechanics. The boundary integral equations are employed to transfer all the unknowns to the boundary, then the unknowns are approximated using neural networks and solved through a training process. The loss function is chosen as the residuals of the boundary integral equations. Regularization techniques are adopted to efficiently evaluate the weakly singular and Cauchy principle integrals in boundary integral equations. Potential problems and elastostatic problems are mainly concerned in this article as a demonstration. The proposed method has several outstanding advantages: First, the dimensions of the original problem are reduced by one, thus the freedoms are greatly reduced. Second, the proposed method does not require any extra treatment to introduce the boundary conditions, since they are naturally considered through the boundary integral equations. Therefore, the method is suitable for complex geometries. Third, BINN is suitable for problems on the infinite or semi-infinite domains. Moreover, BINN can easily handle heterogeneous problems with a single neural network without domain decomposition.  ( 2 min )
    A prediction and behavioural analysis of machine learning methods for modelling travel mode choice. (arXiv:2301.04404v1 [cs.LG])
    The emergence of a variety of Machine Learning (ML) approaches for travel mode choice prediction poses an interesting question to transport modellers: which models should be used for which applications? The answer to this question goes beyond simple predictive performance, and is instead a balance of many factors, including behavioural interpretability and explainability, computational complexity, and data efficiency. There is a growing body of research which attempts to compare the predictive performance of different ML classifiers with classical random utility models. However, existing studies typically analyse only the disaggregate predictive performance, ignoring other aspects affecting model choice. Furthermore, many studies are affected by technical limitations, such as the use of inappropriate validation schemes, incorrect sampling for hierarchical data, lack of external validation, and the exclusive use of discrete metrics. We address these limitations by conducting a systematic comparison of different modelling approaches, across multiple modelling problems, in terms of the key factors likely to affect model choice (out-of-sample predictive performance, accuracy of predicted market shares, extraction of behavioural indicators, and computational efficiency). We combine several real world datasets with synthetic datasets, where the data generation function is known. The results indicate that the models with the highest disaggregate predictive performance (namely extreme gradient boosting and random forests) provide poorer estimates of behavioural indicators and aggregate mode shares, and are more expensive to estimate, than other models, including deep neural networks and Multinomial Logit (MNL). It is further observed that the MNL model performs robustly in a variety of situations, though ML techniques can improve the estimates of behavioural indices such as Willingness to Pay.  ( 2 min )
    Rethinking complex-valued deep neural networks for monaural speech enhancement. (arXiv:2301.04320v1 [cs.SD])
    Despite multiple efforts made towards adopting complex-valued deep neural networks (DNNs), it remains an open question whether complex-valued DNNs are generally more effective than real-valued DNNs for monaural speech enhancement. This work is devoted to presenting a critical assessment by systematically examining complex-valued DNNs against their real-valued counterparts. Specifically, we investigate complex-valued DNN atomic units, including linear layers, convolutional layers, long short-term memory (LSTM), and gated linear units. By comparing complex- and real-valued versions of fundamental building blocks in the recently developed gated convolutional recurrent network (GCRN), we show how different mechanisms for basic blocks affect the performance. We also find that the use of complex-valued operations hinders the model capacity when the model size is small. In addition, we examine two recent complex-valued DNNs, i.e. deep complex convolutional recurrent network (DCCRN) and deep complex U-Net (DCUNET). Evaluation results show that both DNNs produce identical performance to their real-valued counterparts while requiring much more computation. Based on these comprehensive comparisons, we conclude that complex-valued DNNs do not provide a performance gain over their real-valued counterparts for monaural speech enhancement, and thus are less desirable due to their higher computational costs.  ( 2 min )
    On the functional form of the radial acceleration relation. (arXiv:2301.04368v1 [astro-ph.GA])
    We apply a new method for learning equations from data -- Exhaustive Symbolic Regression (ESR) -- to late-type galaxy dynamics as encapsulated in the radial acceleration relation (RAR). Relating the centripetal acceleration due to baryons, $g_\text{bar}$, to the total dynamical acceleration, $g_\text{obs}$, the RAR has been claimed to manifest a new law of nature due to its regularity and tightness, in agreement with Modified Newtonian Dynamics (MOND). Fits to this relation have been restricted by prior expectations to particular functional forms, while ESR affords an exhaustive and nearly prior-free search through functional parameter space to identify the equations optimally trading accuracy with simplicity. Working with the SPARC data, we find the best functions typically satisfy $g_\text{obs} \propto g_\text{bar}$ at high $g_\text{bar}$, although the coefficient of proportionality is not clearly unity and the deep-MOND limit $g_\text{obs} \propto \sqrt{g_\text{bar}}$ as $g_\text{bar} \to 0$ is little evident at all. By generating mock data according to MOND with or without the external field effect, we find that symbolic regression would not be expected to identify the generating function or reconstruct successfully the asymptotic slopes. We conclude that the limited dynamical range and significant uncertainties of the SPARC RAR preclude a definitive statement of its functional form, and hence that this data alone can neither demonstrate nor rule out law-like gravitational behaviour.  ( 2 min )
    Dataset of Fluorescence Spectra and Chemical Parameters of Olive Oils. (arXiv:2301.04471v1 [q-bio.QM])
    This dataset encompasses fluorescence spectra and chemical parameters of 24 olive oil samples from the 2019-2020 harvest provided by the producer Conde de Benalua, Granada, Spain. The oils are characterized by different qualities: 10 extra virgin olive oil (EVOO), 8 virgin olive oil (VOO), and 6 lampante olive oil (LOO) samples. For each sample, the dataset includes fluorescence spectra obtained with two excitation wavelengths, oil quality, and five chemical parameters necessary for the quality assessment of olive oil. The fluorescence spectra were obtained by exciting the samples at 365 nm and 395 nm under identical conditions. The dataset includes the values of the following chemical parameters for each olive oil sample: acidity, peroxide value, K270, K232, ethyl esters, and the quality of the samples (EVOO, VOO, or LOO). The dataset offers a unique possibility for researchers in food technology to develop machine learning models based on fluorescence data for the quality assessment of olive oil due to the availability of both spectroscopic and chemical data. The dataset can be used, for example, to predict one or multiple chemical parameters or to classify samples based on their quality from fluorescence spectra.  ( 2 min )
    Heterogeneous Tri-stream Clustering Network. (arXiv:2301.04451v1 [cs.LG])
    Contrastive deep clustering has recently gained significant attention with its ability of joint contrastive learning and clustering via deep neural networks. Despite the rapid progress, previous works mostly require both positive and negative sample pairs for contrastive clustering, which rely on a relative large batch-size. Moreover, they typically adopt a two-stream architecture with two augmented views, which overlook the possibility and potential benefits of multi-stream architectures (especially with heterogeneous or hybrid networks). In light of this, this paper presents a new end-to-end deep clustering approach termed Heterogeneous Tri-stream Clustering Network (HTCN). The tri-stream architecture in HTCN consists of three main components, including two weight-sharing online networks and a target network, where the parameters of the target network are the exponential moving average of that of the online networks. Notably, the two online networks are trained by simultaneously (i) predicting the instance representations of the target network and (ii) enforcing the consistency between the cluster representations of the target network and that of the two online networks. Experimental results on four challenging image datasets demonstrate the superiority of HTCN over the state-of-the-art deep clustering approaches. The code is available at https://github.com/dengxiaozhi/HTCN.  ( 2 min )
    WuYun: Exploring hierarchical skeleton-guided melody generation using knowledge-enhanced deep learning. (arXiv:2301.04488v1 [cs.SD])
    Although deep learning has revolutionized music generation, existing methods for structured melody generation follow an end-to-end left-to-right note-by-note generative paradigm and treat each note equally. Here, we present WuYun, a knowledge-enhanced deep learning architecture for improving the structure of generated melodies, which first generates the most structurally important notes to construct a melodic skeleton and subsequently infills it with dynamically decorative notes into a full-fledged melody. Specifically, we use music domain knowledge to extract melodic skeletons and employ sequence learning to reconstruct them, which serve as additional knowledge to provide auxiliary guidance for the melody generation process. We demonstrate that WuYun can generate melodies with better long-term structure and musicality and outperforms other state-of-the-art methods by 0.51 on average on all subjective evaluation metrics. Our study provides a multidisciplinary lens to design melodic hierarchical structures and bridge the gap between data-driven and knowledge-based approaches for numerous music generation tasks.  ( 2 min )
    A Meta Path-based Approach for Rumor Detection on Social Media. (arXiv:2301.04341v1 [cs.SI])
    The prominent role of social media in people's daily lives has made them more inclined to receive news through social networks than traditional sources. This shift in public behavior has opened doors for some to diffuse fake news on social media; and subsequently cause negative economic, political, and social consequences as well as distrust among the public. There are many proposed methods to solve the rumor detection problem, most of which do not take full advantage of the heterogeneous nature of news propagation networks. With this intention, we considered a previously proposed architecture as our baseline and performed the idea of structural feature extraction from the heterogeneous rumor propagation over its architecture using the concept of meta path-based embeddings. We named our model Meta Path-based Global Local Attention Network (MGLAN). Extensive experimental analysis on three state-of-the-art datasets has demonstrated that MGLAN outperforms other models by capturing node-level discrimination to different node types.  ( 2 min )
    Combining Self-labeling with Selective Sampling. (arXiv:2301.04420v1 [cs.LG])
    Since data is the fuel that drives machine learning models, and access to labeled data is generally expensive, semi-supervised methods are constantly popular. They enable the acquisition of large datasets without the need for too many expert labels. This work combines self-labeling techniques with active learning in a selective sampling scenario. We propose a new method that builds an ensemble classifier. Based on an evaluation of the inconsistency of the decisions of the individual base classifiers for a given observation, a decision is made on whether to request a new label or use the self-labeling. In preliminary studies, we show that naive application of self-labeling can harm performance by introducing bias towards selected classes and consequently lead to skewed class distribution. Hence, we also propose mechanisms to reduce this phenomenon. Experimental evaluation shows that the proposed method matches current selective sampling methods or achieves better results.  ( 2 min )
    VS-Net: Multiscale Spatiotemporal Features for Lightweight Video Salient Document Detection. (arXiv:2301.04447v1 [cs.CV])
    Video Salient Document Detection (VSDD) is an essential task of practical computer vision, which aims to highlight visually salient document regions in video frames. Previous techniques for VSDD focus on learning features without considering the cooperation among and across the appearance and motion cues and thus fail to perform in practical scenarios. Moreover, most of the previous techniques demand high computational resources, which limits the usage of such systems in resource-constrained settings. To handle these issues, we propose VS-Net, which captures multi-scale spatiotemporal information with the help of dilated depth-wise separable convolution and Approximation Rank Pooling. VS-Net extracts the key features locally from each frame across embedding sub-spaces and forwards the features between adjacent and parallel nodes, enhancing model performance globally. Our model generates saliency maps considering both the background and foreground simultaneously, making it perform better in challenging scenarios. The immense experiments regulated on the benchmark MIDV-500 dataset show that the VS-Net model outperforms state-of-the-art approaches in both time and robustness measures.  ( 2 min )
    Loss-Controlling Calibration for Predictive Models. (arXiv:2301.04378v1 [cs.LG])
    We propose a learning framework for calibrating predictive models to make loss-controlling prediction for exchangeable data, which extends our recently proposed conformal loss-controlling prediction for more general cases. By comparison, the predictors built by the proposed loss-controlling approach are not limited to set predictors, and the loss function can be any measurable function without the monotone assumption. To control the loss values in an efficient way, we introduce transformations preserving exchangeability to prove finite-sample controlling guarantee when the test label is obtained, and then develop an approximation approach to construct predictors. The transformations can be built on any predefined function, which include using optimization algorithms for parameter searching. This approach is a natural extension of conformal loss-controlling prediction, since it can be reduced to the latter when the set predictors have the nesting property and the loss functions are monotone. Our proposed method is tested empirically for high-impact weather forecasting and the experimental results demonstrate its effectiveness for controlling the non-monotone loss related to false discovery.  ( 2 min )
    An Analysis of Quantile Temporal-Difference Learning. (arXiv:2301.04462v1 [cs.LG])
    We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.  ( 2 min )
    Multiple-level Point Embedding for Solving Human Trajectory Imputation with Prediction. (arXiv:2301.04482v1 [cs.LG])
    Sparsity is a common issue in many trajectory datasets, including human mobility data. This issue frequently brings more difficulty to relevant learning tasks, such as trajectory imputation and prediction. Nowadays, little existing work simultaneously deals with imputation and prediction on human trajectories. This work plans to explore whether the learning process of imputation and prediction could benefit from each other to achieve better outcomes. And the question will be answered by studying the coexistence patterns between missing points and observed ones in incomplete trajectories. More specifically, the proposed model develops an imputation component based on the self-attention mechanism to capture the coexistence patterns between observations and missing points among encoder-decoder layers. Meanwhile, a recurrent unit is integrated to extract the sequential embeddings from newly imputed sequences for predicting the following location. Furthermore, a new implementation called Imputation Cycle is introduced to enable gradual imputation with prediction enhancement at multiple levels, which helps to accelerate the speed of convergence. The experimental results on three different real-world mobility datasets show that the proposed approach has significant advantages over the competitive baselines across both imputation and prediction tasks in terms of accuracy and stability.  ( 2 min )
    Network Adaptive Federated Learning: Congestion and Lossy Compression. (arXiv:2301.04430v1 [cs.LG])
    In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server. As such FL systems are exposed to, or indeed the cause of, congestion across a wide set of network resources. Lossy compression can be used to reduce the size of exchanged files and associated delays, at the cost of adding noise to model updates. By judiciously adapting clients' compression to varying network congestion, an FL application can reduce wall clock training time. To that end, we propose a Network Adaptive Compression (NAC-FL) policy, which dynamically varies the client's lossy compression choices to network congestion variations. We prove, under appropriate assumptions, that NAC-FL is asymptotically optimal in terms of directly minimizing the expected wall clock training time. Further, we show via simulation that NAC-FL achieves robust performance improvements with higher gains in settings with positively correlated delays across time.  ( 2 min )
    SoK: Adversarial Machine Learning Attacks and Defences in Multi-Agent Reinforcement Learning. (arXiv:2301.04299v1 [cs.LG])
    Multi-Agent Reinforcement Learning (MARL) is vulnerable to Adversarial Machine Learning (AML) attacks and needs adequate defences before it can be used in real world applications. We have conducted a survey into the use of execution-time AML attacks against MARL and the defences against those attacks. We surveyed related work in the application of AML in Deep Reinforcement Learning (DRL) and Multi-Agent Learning (MAL) to inform our analysis of AML for MARL. We propose a novel perspective to understand the manner of perpetrating an AML attack, by defining Attack Vectors. We develop two new frameworks to address a gap in current modelling frameworks, focusing on the means and tempo of an AML attack against MARL, and identify knowledge gaps and future avenues of research.  ( 2 min )
    Robust Bayesian Target Value Optimization. (arXiv:2301.04344v1 [cs.LG])
    We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.  ( 2 min )
    Beyond Graph Convolutional Network: An Interpretable Regularizer-centered Optimization Framework. (arXiv:2301.04318v1 [cs.LG])
    Graph convolutional networks (GCNs) have been attracting widespread attentions due to their encouraging performance and powerful generalizations. However, few work provide a general view to interpret various GCNs and guide GCNs' designs. In this paper, by revisiting the original GCN, we induce an interpretable regularizer-centerd optimization framework, in which by building appropriate regularizers we can interpret most GCNs, such as APPNP, JKNet, DAGNN, and GNN-LF/HF. Further, under the proposed framework, we devise a dual-regularizer graph convolutional network (dubbed tsGCN) to capture topological and semantic structures from graph data. Since the derived learning rule for tsGCN contains an inverse of a large matrix and thus is time-consuming, we leverage the Woodbury matrix identity and low-rank approximation tricks to successfully decrease the high computational complexity of computing infinite-order graph convolutions. Extensive experiments on eight public datasets demonstrate that tsGCN achieves superior performance against quite a few state-of-the-art competitors w.r.t. classification tasks.  ( 2 min )
    Learnable Path in Neural Controlled Differential Equations. (arXiv:2301.04333v1 [cs.LG])
    Neural controlled differential equations (NCDEs), which are continuous analogues to recurrent neural networks (RNNs), are a specialized model in (irregular) time-series processing. In comparison with similar models, e.g., neural ordinary differential equations (NODEs), the key distinctive characteristics of NCDEs are i) the adoption of the continuous path created by an interpolation algorithm from each raw discrete time-series sample and ii) the adoption of the Riemann--Stieltjes integral. It is the continuous path which makes NCDEs be analogues to continuous RNNs. However, NCDEs use existing interpolation algorithms to create the path, which is unclear whether they can create an optimal path. To this end, we present a method to generate another latent path (rather than relying on existing interpolation algorithms), which is identical to learning an appropriate interpolation method. We design an encoder-decoder module based on NCDEs and NODEs, and a special training method for it. Our method shows the best performance in both time-series classification and forecasting.  ( 2 min )
    Synthetic data generation method for data-free knowledge distillation in regression neural networks. (arXiv:2301.04338v1 [cs.LG])
    Knowledge distillation is the technique of compressing a larger neural network, known as the teacher, into a smaller neural network, known as the student, while still trying to maintain the performance of the larger neural network as much as possible. Existing methods of knowledge distillation are mostly applicable for classification tasks. Many of them also require access to the data used to train the teacher model. To address the problem of knowledge distillation for regression tasks under the absence of original training data, previous work has proposed a data-free knowledge distillation method where synthetic data are generated using a generator model trained adversarially against the student model. These synthetic data and their labels predicted by the teacher model are then used to train the student model. In this study, we investigate the behavior of various synthetic data generation methods and propose a new synthetic data generation strategy that directly optimizes for a large but bounded difference between the student and teacher model. Our results on benchmark and case study experiments demonstrate that the proposed strategy allows the student model to learn better and emulate the performance of the teacher model more closely.  ( 2 min )
    Application of machine learning to gas flaring. (arXiv:2301.04141v1 [cs.LG])
    Currently in the petroleum industry, operators often flare the produced gas instead of commodifying it. The flaring magnitudes are large in some states, which constitute problems with energy waste and CO2 emissions. In North Dakota, operators are required to estimate and report the volume flared. The questions are, how good is the quality of this reporting, and what insights can be drawn from it? Apart from the company-reported statistics, which are available from the North Dakota Industrial Commission (NDIC), flared volumes can be estimated via satellite remote sensing, serving as an unbiased benchmark. Since interpretation of the Landsat 8 imagery is hindered by artifacts due to glow, the estimated volumes based on the Visible Infrared Imaging Radiometer Suite (VIIRS) are used. Reverse geocoding is performed for comparing and contrasting the NDIC and VIIRS data at different levels, such as county and oilfield. With all the data gathered and preprocessed, Bayesian learning implemented by MCMC methods is performed to address three problems: county level model development, flaring time series analytics, and distribution estimation. First, there is heterogeneity among the different counties, in the associations between the NDIC and VIIRS volumes. In light of such, models are developed for each county by exploiting hierarchical models. Second, the flaring time series, albeit noisy, contains information regarding trends and patterns, which provide some insights into operator approaches. Gaussian processes are found to be effective in many different pattern recognition scenarios. Third, distributional insights are obtained through unsupervised learning. The negative binomial and GMMs are found to effectively describe the oilfield flare count and flared volume distributions, respectively. Finally, a nearest-neighbor-based approach for operator level monitoring and analytics is introduced.  ( 2 min )
    Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models. (arXiv:2301.04213v1 [cs.LG])
    Language models are known to learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights (Meng et al., 2022). In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific parameters in models would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit (Meng et al., 2022). Next, to better understand the discrepancy between representation denoising and weight editing, we develop several variants of the editing problem that appear more and more like representation denoising in their design and objective. Experiments show that, for one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior. Code is available at: https://github.com/google/belief-localization  ( 2 min )
    Age of Information in Deep Learning-Driven Task-Oriented Communications. (arXiv:2301.04298v1 [cs.IT])
    This paper studies the notion of age in task-oriented communications that aims to execute a task at a receiver utilizing the data at its transmitter. The transmitter-receiver operations are modeled as an encoder-decoder pair of deep neural networks (DNNs) that are jointly trained while considering channel effects. The encoder converts data samples into feature vectors of small dimension and transmits them with a small number of channel uses thereby reducing the number of transmissions and latency. Instead of reconstructing input samples, the decoder performs a task, e.g., classification, on the received signals. Applying different DNNs on MNIST and CIFAR-10 image data, the classifier accuracy is shown to increase with the number of channel uses at the expense of longer service time. The peak age of task information (PAoTI) is introduced to analyze this accuracy-latency tradeoff when the age grows unless a received signal is classified correctly. By incorporating channel and traffic effects, design guidelines are obtained for task-oriented communications by characterizing how the PAoTI first decreases and then increases with the number of channels uses. A dynamic update mechanism is presented to adapt the number of channel uses to channel and traffic conditions, and reduce the PAoTI in task-oriented communications.  ( 2 min )
    Data Distillation: A Survey. (arXiv:2301.04272v1 [cs.LG])
    The popularity of deep learning has led to the curation of a vast number of massive and multifarious datasets. Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems such as (a) high model-training time; (b) slow research iteration; and (c) poor eco-sustainability. As an alternative, data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset for scenarios like model training, inference, architecture search, etc. In this survey, we present a formal framework for data distillation, along with providing a detailed taxonomy of existing approaches. Additionally, we cover data distillation approaches for different data modalities, namely images, graphs, and user-item interactions (recommender systems), while also identifying current challenges and future research directions.  ( 2 min )
    Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images. (arXiv:2301.04224v1 [cs.CV])
    Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.  ( 2 min )
    A Possible Converter to Denoise the Images of Exoplanet Candidates through Machine Learning Techniques. (arXiv:2301.04292v1 [astro-ph.EP])
    The method of direct imaging has detected many exoplanets and made important contribution to the field of planet formation. The standard method employs angular differential imaging (ADI) technique, and more ADI image frames could lead to the results with larger signal-to-noise-ratio (SNR). However, it would need precious observational time from large telescopes, which are always over-subscribed. We thus explore the possibility to generate a converter which can increase the SNR derived from a smaller number of ADI frames. The machine learning technique with two-dimension convolutional neural network (2D-CNN) is tested here. Several 2D-CNN models are trained and their performances of denoising are presented and compared. It is found that our proposed Modified five-layer Wide Inference Network with the Residual learning technique and Batch normalization (MWIN5-RB) can give the best result. We conclude that this MWIN5-RB can be employed as a converter for future observational data.  ( 2 min )
    Diffusion Models For Stronger Face Morphing Attacks. (arXiv:2301.04218v1 [cs.CV])
    Face morphing attacks seek to deceive a Face Recognition (FR) system by presenting a morphed image consisting of the biometric qualities from two different identities with the aim of triggering a false acceptance with one of the two identities, thereby presenting a significant threat to biometric systems. The success of a morphing attack is dependent on the ability of the morphed image to represent the biometric characteristics of both identities that were used to create the image. We present a novel morphing attack that uses a Diffusion-based architecture to improve the visual fidelity of the image and improve the ability of the morphing attack to represent characteristics from both identities. We demonstrate the high fidelity of the proposed attack by evaluating its visual fidelity via the Frechet Inception Distance. Extensive experiments are conducted to measure the vulnerability of FR systems to the proposed attack. The proposed attack is compared to two state-of-the-art GAN-based morphing attacks along with two Landmark-based attacks. The ability of a morphing attack detector to detect the proposed attack is measured and compared against the other attacks. Additionally, a novel metric to measure the relative strength between morphing attacks is introduced and evaluated.  ( 2 min )
    schlably: A Python Framework for Deep Reinforcement Learning Based Scheduling Experiments. (arXiv:2301.04182v1 [cs.LG])
    Research on deep reinforcement learning (DRL) based production scheduling (PS) has gained a lot of attention in recent years, primarily due to the high demand for optimizing scheduling problems in diverse industry settings. Numerous studies are carried out and published as stand-alone experiments that often vary only slightly with respect to problem setups and solution approaches. The programmatic core of these experiments is typically very similar. Despite this fact, no standardized and resilient framework for experimentation on PS problems with DRL algorithms could be established so far. In this paper, we introduce schlably, a Python-based framework that provides researchers a comprehensive toolset to facilitate the development of PS solution strategies based on DRL. schlably eliminates the redundant overhead work that the creation of a sturdy and flexible backbone requires and increases the comparability and reusability of conducted research work.  ( 2 min )
    ClimaBench: A Benchmark Dataset For Climate Change Text Understanding in English. (arXiv:2301.04253v1 [cs.CL])
    The topic of Climate Change (CC) has received limited attention in NLP despite its real world urgency. Activists and policy-makers need NLP tools in order to effectively process the vast and rapidly growing textual data produced on CC. Their utility, however, primarily depends on whether the current state-of-the-art models can generalize across various tasks in the CC domain. In order to address this gap, we introduce Climate Change Benchmark (ClimaBench), a benchmark collection of existing disparate datasets for evaluating model performance across a diverse set of CC NLU tasks systematically. Further, we enhance the benchmark by releasing two large-scale labelled text classification and question-answering datasets curated from publicly available environmental disclosures. Lastly, we provide an analysis of several generic and CC-oriented models answering whether fine-tuning on domain text offers any improvements across these tasks. We hope this work provides a standard assessment tool for research on CC text data.  ( 2 min )
    Adversarial Online Multi-Task Reinforcement Learning. (arXiv:2301.04268v1 [cs.LG])
    We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.  ( 2 min )
    An Efficient Drifters Deployment Strategy to Evaluate Water Current Velocity Fields. (arXiv:2301.04216v1 [cs.LG])
    Water current prediction is essential for understanding ecosystems, and to shed light on the role of the ocean in the global climate context. Solutions vary from physical modeling, and long-term observations, to short-term measurements. In this paper, we consider a common approach for water current prediction that uses Lagrangian floaters for water current prediction by interpolating the trajectory of the elements to reflect the velocity field. Here, an important aspect that has not been addressed before is where to initially deploy the drifting elements such that the acquired velocity field would efficiently represent the water current. To that end, we use a clustering approach that relies on a physical model of the velocity field. Our method segments the modeled map and determines the deployment locations as those that will lead the floaters to 'visit' the center of the different segments. This way, we validate that the area covered by the floaters will capture the in-homogeneously in the velocity field. Exploration over a dataset of velocity field maps that span over a year demonstrates the applicability of our approach, and shows a considerable improvement over the common approach of uniformly randomly choosing the initial deployment sites. Finally, our implementation code can be found in [1].  ( 2 min )
    Explaining Deep Models through Forgettable Learning Dynamics. (arXiv:2301.04221v1 [cs.CV])
    Even though deep neural networks have shown tremendous success in countless applications, explaining model behaviour or predictions is an open research problem. In this paper, we address this issue by employing a simple yet effective method by analysing the learning dynamics of deep neural networks in semantic segmentation tasks. Specifically, we visualize the learning behaviour during training by tracking how often samples are learned and forgotten in subsequent training epochs. This further allows us to derive important information about the proximity to the class decision boundary and identify regions that pose a particular challenge to the model. Inspired by this phenomenon, we present a novel segmentation method that actively uses this information to alter the data representation within the model by increasing the variety of difficult regions. Finally, we show that our method consistently reduces the amount of regions that are forgotten frequently. We further evaluate our method in light of the segmentation performance.  ( 2 min )
    Towards Microstructural State Variables in Materials Systems. (arXiv:2301.04261v1 [cs.LG])
    The vast combination of material properties seen in nature are achieved by the complexity of the material microstructure. Advanced characterization and physics based simulation techniques have led to generation of extremely large microstructural datasets. There is a need for machine learning techniques that can manage data complexity by capturing the maximal amount of information about the microstructure using the least number of variables. This paper aims to formulate dimensionality and state variable estimation techniques focused on reducing microstructural image data. It is shown that local dimensionality estimation based on nearest neighbors tend to give consistent dimension estimates for natural images for all p-Minkowski distances. However, it is found that dimensionality estimates have a systematic error for low-bit depth microstructural images. The use of Manhattan distance to alleviate this issue is demonstrated. It is also shown that stacked autoencoders can reconstruct the generator space of high dimensional microstructural data and provide a sparse set of state variables to fully describe the variability in material microstructures.  ( 2 min )
    A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization. (arXiv:2301.04204v1 [math.OC])
    In this paper we consider finding an approximate second-order stationary point (SOSP) of general nonconvex conic optimization that minimizes a twice differentiable function subject to nonlinear equality constraints and also a convex conic constraint. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier-augmented Lagrangian method for finding an approximate SOSP of this problem. Under some mild assumptions, we show that our method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of general nonconvex conic optimization with high probability. Moreover, under a constraint qualification, these complexity bounds are improved to $\widetilde{\cal O}(\epsilon^{-7/2})$ and $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$, respectively. To the best of our knowledge, this is the first study on the complexity of finding an approximate SOSP of general nonconvex conic optimization. Preliminary numerical results are presented to demonstrate superiority of the proposed method over first-order methods in terms of solution quality.  ( 2 min )
    ODIM: an efficient method to detect outliers via inlier-memorization effect of deep generative models. (arXiv:2301.04257v1 [stat.ML])
    Identifying whether a given sample is an outlier or not is an important issue in various real-world domains. This study aims to solve the unsupervised outlier detection problem where training data contain outliers, but any label information about inliers and outliers is not given. We propose a powerful and efficient learning framework to identify outliers in a training data set using deep neural networks. We start with a new observation called the inlier-memorization (IM) effect. When we train a deep generative model with data contaminated with outliers, the model first memorizes inliers before outliers. Exploiting this finding, we develop a new method called the outlier detection via the IM effect (ODIM). The ODIM only requires a few updates; thus, it is computationally efficient, tens of times faster than other deep-learning-based algorithms. Also, the ODIM filters out outliers successfully, regardless of the types of data, such as tabular, image, and sequential. We empirically demonstrate the superiority and efficiency of the ODIM by analyzing 20 data sets.  ( 2 min )
    Inferring Gene Regulatory Neural Networks for Bacterial Decision Making in Biofilms. (arXiv:2301.04225v1 [q-bio.MN])
    Bacterial cells are sensitive to a range of external signals used to learn the environment. These incoming external signals are then processed using a Gene Regulatory Network (GRN), exhibiting similarities to modern computing algorithms. An in-depth analysis of gene expression dynamics suggests an inherited Gene Regulatory Neural Network (GRNN) behavior within the GRN that enables the cellular decision-making based on received signals from the environment and neighbor cells. In this study, we extract a sub-network of \textit{Pseudomonas aeruginosa} GRN that is associated with one virulence factor: pyocyanin production as a use case to investigate the GRNN behaviors. Further, using Graph Neural Network (GNN) architecture, we model a single species biofilm to reveal the role of GRNN dynamics on ecosystem-wide decision-making. Varying environmental conditions, we prove that the extracted GRNN computes input signals similar to natural decision-making process of the cell. Identifying of neural network behaviors in GRNs may lead to more accurate bacterial cell activity predictive models for many applications, including human health-related problems and agricultural applications. Further, this model can produce data on causal relationships throughout the network, enabling the possibility of designing tailor-made infection-controlling mechanisms. More interestingly, these GRNNs can perform computational tasks for bio-hybrid computing systems.  ( 2 min )
    Analogical Relevance Index. (arXiv:2301.04134v1 [cs.LG])
    Focusing on the most significant features of a dataset is useful both in machine learning (ML) and data mining. In ML, it can lead to a higher accuracy, a faster learning process, and ultimately a simpler and more understandable model. In data mining, identifying significant features is essential not only for gaining a better understanding of the data but also for visualization. In this paper, we demonstrate a new way of identifying significant features inspired by analogical proportions. Such a proportion is of the form of "a is to b as c is to d", comparing two pairs of items (a, b) and (c, d) in terms of similarities and dissimilarities. In a classification context, if the similarities/dissimilarities between a and b correlate with the fact that a and b have different labels, this knowledge can be transferred to c and d, inferring that c and d also have different labels. From a feature selection perspective, observing a huge number of such pairs (a, b) where a and b have different labels provides a hint about the importance of the features where a and b differ. Following this idea, we introduce the Analogical Relevance Index (ARI), a new statistical test of the significance of a given feature with respect to the label. ARI is a filter-based method. Filter-based methods are ML-agnostic but generally unable to handle feature redundancy. However, ARI can detect feature redundancy. Our experiments show that ARI is effective and outperforms well-known methods on a variety of artificial and some real datasets.  ( 2 min )
    Predicting Hateful Discussions on Reddit using Graph Transformer Networks and Communal Context. (arXiv:2301.04248v1 [cs.CL])
    We propose a system to predict harmful discussions on social media platforms. Our solution uses contextual deep language models and proposes the novel idea of integrating state-of-the-art Graph Transformer Networks to analyze all conversations that follow an initial post. This framework also supports adapting to future comments as the conversation unfolds. In addition, we study whether a community-specific analysis of hate speech leads to more effective detection of hateful discussions. We evaluate our approach on 333,487 Reddit discussions from various communities. We find that community-specific modeling improves performance two-fold and that models which capture wider-discussion context improve accuracy by 28\% (35\% for the most hateful content) compared to limited context models.  ( 2 min )
  • Open

    Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization. (arXiv:2207.13676v2 [cs.LG] UPDATED)
    Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS) Vizier, a standalone Python-based interface for blackbox optimization and research, based on the Google-internal Vizier infrastructure and framework. OSS Vizier provides an API capable of defining and solving a wide variety of optimization problems, including multi-metric, early stopping, transfer learning, and conditional search. Furthermore, it is designed to be a distributed system that assures reliability, and allows multiple parallel evaluations of the user's objective function. The flexible RPC-based infrastructure allows users to access OSS Vizier from binaries written in any language. OSS Vizier also provides a back-end ("Pythia") API that gives algorithm authors a way to interface new algorithms with the core OSS Vizier system. OSS Vizier is available at https://github.com/google/vizier.  ( 2 min )
    Learning fair representation with a parametric integral probability metric. (arXiv:2202.02943v4 [stat.ML] UPDATED)
    As they have a vital effect on social decision-making, AI algorithms should be not only accurate but also fair. Among various algorithms for fairness AI, learning fair representation (LFR), whose goal is to find a fair representation with respect to sensitive variables such as gender and race, has received much attention. For LFR, the adversarial training scheme is popularly employed as is done in the generative adversarial network type algorithms. The choice of a discriminator, however, is done heuristically without justification. In this paper, we propose a new adversarial training scheme for LFR, where the integral probability metric (IPM) with a specific parametric family of discriminators is used. The most notable result of the proposed LFR algorithm is its theoretical guarantee about the fairness of the final prediction model, which has not been considered yet. That is, we derive theoretical relations between the fairness of representation and the fairness of the prediction model built on the top of the representation (i.e., using the representation as the input). Moreover, by numerical experiments, we show that our proposed LFR algorithm is computationally lighter and more stable, and the final prediction model is competitive or superior to other LFR algorithms using more complex discriminators.  ( 2 min )
    Benign Overfitting in Time Series Linear Model with Over-Parameterization. (arXiv:2204.08369v2 [math.ST] UPDATED)
    The success of large-scale models in recent years has increased the importance of statistical models with numerous parameters. Several studies have analyzed over-parameterized linear models with high-dimensional data that may not be sparse; however, existing results depend on the independent setting of samples. In this study, we analyze a linear regression model with dependent time series data under over-parameterization settings. We consider an estimator via interpolation and developed a theory for the excess risk of the estimator. Then, we derive bounds of risks by the estimator for the cases where the temporal correlation of each coordinate of dependent data is homogeneous and heterogeneous, respectively. The derived bounds reveal that a temporal covariance of the data plays a key role; its strength affects the bias of the risk, and its nondegeneracy affects the variance of the risk. Moreover, for the heterogeneous correlation case, we show that the convergence rate of risks with short-memory processes is identical to that of cases with independent data, and the risk can converge to zero even with long-memory processes. Our theory can be extended to infinite-dimensional data in a unified manner. We also present several examples of specific dependent processes that can be applied to our setting.  ( 2 min )
    FLEA: Provably Robust Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v4 [cs.LG] UPDATED)
    Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but also do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might not be representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that identifies and suppresses those data sources that would have a negative impact on fairness or accuracy if they were used for training. As such, FLEA is not a replacement of prior fairness-aware learning methods but rather an augmentation that makes any of them robust against unreliable training data. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally, we prove formally that -- given enough data -- FLEA protects the learner against corruptions as long as the fraction of affected data sources is less than half. Our source code and documentation are available at https://github.com/ISTAustria-CVML/FLEA.  ( 2 min )
    Quantifying the Impact of Label Noise on Federated Learning. (arXiv:2211.07816v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.  ( 2 min )
    Contrastive Neural Ratio Estimation. (arXiv:2210.06170v2 [stat.ML] UPDATED)
    Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.  ( 2 min )
    Fast Multi-view Clustering via Ensembles: Towards Scalability, Superiority, and Simplicity. (arXiv:2203.11572v2 [cs.LG] UPDATED)
    Despite significant progress, there remain three limitations to the previous multi-view clustering algorithms. First, they often suffer from high computational complexity, restricting their feasibility for large-scale datasets. Second, they typically fuse multi-view information via one-stage fusion, neglecting the possibilities in multi-stage fusions. Third, dataset-specific hyperparameter-tuning is frequently required, further undermining their practicability. In light of this, we propose a fast multi-view clustering via ensembles (FastMICE) approach. Particularly, the concept of random view groups is presented to capture the versatile view-wise relationships, through which the hybrid early-late fusion strategy is designed to enable efficient multi-stage fusions. With multiple views extended to many view groups, three levels of diversity (w.r.t. features, anchors, and neighbors, respectively) are jointly leveraged for constructing the view-sharing bipartite graphs in the early-stage fusion. Then, a set of diversified base clusterings for different view groups are obtained via fast graph partitioning, which are further formulated into a unified bipartite graph for final clustering in the late-stage fusion. Notably, FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning. Experiments on 22 multi-view datasets demonstrate its advantages in scalability (for extremely large datasets), superiority (in clustering performance), and simplicity (to be applied) over the state-of-the-art. Code available: https://github.com/huangdonghere/FastMICE.  ( 2 min )
    Towards Backdoor Attacks and Defense in Robust Machine Learning Models. (arXiv:2003.00865v4 [cs.CV] UPDATED)
    The introduction of robust optimisation has pushed the state-of-the-art in defending against adversarial attacks. Notably, the state-of-the-art projected gradient descent (PGD)-based training method has been shown to be universally and reliably effective in defending against adversarial inputs. This robustness approach uses PGD as a reliable and universal "first-order adversary". However, the behaviour of such optimisation has not been studied in the light of a fundamentally different class of attacks called backdoors. In this paper, we study how to inject and defend against backdoor attacks for robust models trained using PGD-based robust optimisation. We demonstrate that these models are susceptible to backdoor attacks. Subsequently, we observe that backdoors are reflected in the feature representation of such models. Then, this observation is leveraged to detect such backdoor-infected models via a detection technique called AEGIS. Specifically, given a robust Deep Neural Network (DNN) that is trained using PGD-based first-order adversarial training approach, AEGIS uses feature clustering to effectively detect whether such DNNs are backdoor-infected or clean. In our evaluation of several visible and hidden backdoor triggers on major classification tasks using CIFAR-10, MNIST and FMNIST datasets, AEGIS effectively detects PGD-trained robust DNNs infected with backdoors. AEGIS detects such backdoor-infected models with 91.6% accuracy (11 out of 12 tested models), without any false positives. Furthermore, AEGIS detects the targeted class in the backdoor-infected model with a reasonably low (11.1%) false positive rate. Our investigation reveals that salient features of adversarially robust DNNs could be promising to break the stealthy nature of backdoor attacks.  ( 3 min )
    Improving And Analyzing Neural Speaker Embeddings for ASR. (arXiv:2301.04571v1 [cs.CL])
    Neural speaker embeddings encode the speaker's speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, few studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved embedding extraction pipeline in combination with the Weighted-Simple-Add integration method results in x-vector and c-vector reaching on par performance with i-vectors. We further compare and analyze different speaker embeddings. We present our acoustic model improvements obtained by switching from newbob learning rate schedule to one cycle learning schedule resulting in a ~3% relative WER reduction on Switchboard, additionally reducing the overall training time by 17%. By further adding neural speaker embeddings, we gain additional ~3% relative WER improvement on Hub5'00. Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.  ( 2 min )
    ODIM: an efficient method to detect outliers via inlier-memorization effect of deep generative models. (arXiv:2301.04257v1 [stat.ML])
    Identifying whether a given sample is an outlier or not is an important issue in various real-world domains. This study aims to solve the unsupervised outlier detection problem where training data contain outliers, but any label information about inliers and outliers is not given. We propose a powerful and efficient learning framework to identify outliers in a training data set using deep neural networks. We start with a new observation called the inlier-memorization (IM) effect. When we train a deep generative model with data contaminated with outliers, the model first memorizes inliers before outliers. Exploiting this finding, we develop a new method called the outlier detection via the IM effect (ODIM). The ODIM only requires a few updates; thus, it is computationally efficient, tens of times faster than other deep-learning-based algorithms. Also, the ODIM filters out outliers successfully, regardless of the types of data, such as tabular, image, and sequential. We empirically demonstrate the superiority and efficiency of the ODIM by analyzing 20 data sets.  ( 2 min )
    Trajectory Modeling via Random Utility Inverse Reinforcement Learning. (arXiv:2105.12092v2 [cs.AI] UPDATED)
    We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. Cars are detected by sensors placed on sparsely distributed points on the street network of a city. As rational agents, drivers are trying to maximize some reward function unknown to an external observer. We apply the concept of random utility from econometrics to model the unknown reward function as a function of observed and unobserved features. In contrast to current inverse reinforcement learning approaches, we do not assume that agents act according to a stochastic policy; rather, we assume that agents act according to a deterministic optimal policy and show that randomness in data arises because the exact rewards are not fully observed by an external observer. We introduce the concept of extended state to cope with unobserved features and develop a Markov decision process formulation of drivers decisions. We present theoretical results which guarantee the existence of solutions and show that maximum entropy inverse reinforcement learning is a particular case of our approach. Finally, we illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city in Brazil.  ( 2 min )
    Network Adaptive Federated Learning: Congestion and Lossy Compression. (arXiv:2301.04430v1 [cs.LG])
    In order to achieve the dual goals of privacy and learning across distributed data, Federated Learning (FL) systems rely on frequent exchanges of large files (model updates) between a set of clients and the server. As such FL systems are exposed to, or indeed the cause of, congestion across a wide set of network resources. Lossy compression can be used to reduce the size of exchanged files and associated delays, at the cost of adding noise to model updates. By judiciously adapting clients' compression to varying network congestion, an FL application can reduce wall clock training time. To that end, we propose a Network Adaptive Compression (NAC-FL) policy, which dynamically varies the client's lossy compression choices to network congestion variations. We prove, under appropriate assumptions, that NAC-FL is asymptotically optimal in terms of directly minimizing the expected wall clock training time. Further, we show via simulation that NAC-FL achieves robust performance improvements with higher gains in settings with positively correlated delays across time.  ( 2 min )
    Robust Bayesian Target Value Optimization. (arXiv:2301.04344v1 [cs.LG])
    We consider the problem of finding an input to a stochastic black box function such that the scalar output of the black box function is as close as possible to a target value in the sense of the expected squared error. While the optimization of stochastic black boxes is classic in (robust) Bayesian optimization, the current approaches based on Gaussian processes predominantly focus either on i) maximization/minimization rather than target value optimization or ii) on the expectation, but not the variance of the output, ignoring output variations due to stochasticity in uncontrollable environmental variables. In this work, we fill this gap and derive acquisition functions for common criteria such as the expected improvement, the probability of improvement, and the lower confidence bound, assuming that aleatoric effects are Gaussian with known variance. Our experiments illustrate that this setting is compatible with certain extensions of Gaussian processes, and show that the thus derived acquisition functions can outperform classical Bayesian optimization even if the latter assumptions are violated. An industrial use case in billet forging is presented.  ( 2 min )
    Convex Surrogate Loss Functions for Contextual Pricing with Transaction Data. (arXiv:2202.10944v2 [cs.LG] UPDATED)
    We study an off-policy contextual pricing problem where the seller has access to samples of prices that customers were previously offered, whether they purchased at that price, and auxiliary features describing the customer and/or item being sold. This is in contrast to the well-studied setting in which samples of the customer's valuation (willingness to pay) are observed. In our setting, the observed data is influenced by the previous pricing policy, and we do not know how customers would have responded to alternative prices. We introduce suitable loss functions for this setting that can be directly optimized to find an effective pricing policy with expected revenue guarantees, without the need for estimation of an intermediate demand function. We focus on convex loss functions. This is particularly relevant when linear pricing policies are desired for interpretability reasons, resulting in a tractable convex revenue optimization problem. We propose generalized hinge and quantile pricing loss functions that price at a multiplicative factor of the conditional expected valuation or a particular quantile of the prices that sold, despite the valuation data not being observed. We prove expected revenue bounds for these pricing policies respectively when the valuation distribution is log-concave, and we provide generalization bounds for the finite sample case. Finally, we conduct simulations on both synthetic and real-world data to demonstrate that this approach is competitive with, and in some settings outperforms, state-of-the-art methods in contextual pricing.  ( 2 min )
    An Analysis of Quantile Temporal-Difference Learning. (arXiv:2301.04462v1 [cs.LG])
    We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.  ( 2 min )
    A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimization. (arXiv:2301.04204v1 [math.OC])
    In this paper we consider finding an approximate second-order stationary point (SOSP) of general nonconvex conic optimization that minimizes a twice differentiable function subject to nonlinear equality constraints and also a convex conic constraint. In particular, we propose a Newton-conjugate gradient (Newton-CG) based barrier-augmented Lagrangian method for finding an approximate SOSP of this problem. Under some mild assumptions, we show that our method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of general nonconvex conic optimization with high probability. Moreover, under a constraint qualification, these complexity bounds are improved to $\widetilde{\cal O}(\epsilon^{-7/2})$ and $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$, respectively. To the best of our knowledge, this is the first study on the complexity of finding an approximate SOSP of general nonconvex conic optimization. Preliminary numerical results are presented to demonstrate superiority of the proposed method over first-order methods in terms of solution quality.  ( 2 min )
    Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence. (arXiv:2105.11066v4 [cs.LG] UPDATED)
    Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operational constraints. These can often be accounted for via regularized RL, which augments the target value function with a structure-promoting regularizer. Focusing on discounted infinite-horizon Markov decision processes, we propose a generalized policy mirror descent (GPMD) algorithm for solving regularized RL. As a generalization of policy mirror descent (arXiv:2102.00135), our algorithm accommodates a general class of convex regularizers and promotes the use of Bregman divergence in cognizant of the regularizer in use. We demonstrate that our algorithm converges linearly to the global solution over an entire range of learning rates, in a dimension-free fashion, even when the regularizer lacks strong convexity and smoothness. In addition, this linear convergence feature is provably stable in the face of inexact policy evaluation and imperfect policy updates. Numerical experiments are provided to corroborate the appealing performance of GPMD.  ( 2 min )
    Adversarial Online Multi-Task Reinforcement Learning. (arXiv:2301.04268v1 [cs.LG])
    We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $\mathcal{M}$ are well-separated under a notion of $\lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $\Omega(K\sqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $\Omega(\frac{K}{\lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $\tilde{O}(\frac{K}{\lambda^2})$ sample complexity guarantee for the clustering phase and $\tilde{O}(\sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $\frac{1}{\lambda^2}$ is tight.  ( 2 min )

  • Open

    [D] What's your opinion on "neurocompositional computing"? (Microsoft paper from April 2022)
    Paper: https://arxiv.org/abs/2205.01128 TL;DR It's a paper that tries to design systems that generalize. They argue there are two forms of computing: Compositional and Continuous. Continuous computation is what neural networks are traditionally good at - creating a function that approximates a solution to a problem. Compositional computation is directly manipulating symbols, logic, ideas, etc - and unlike continuous computation, it's capable of generalizing from small datasets. But so far it's only useful inside carefully-constructed formal systems. The authors believe research should be focused on combining the two, and implementing Compositionality fully with neural networks. They suggest some ways to do this. They also believe that the success of architectures like CNNs and Transformers comes from implementing a limited form of Compositionality. This is a very interesting idea, but I have a little bit of skeptism: This paper is heavy on theory and less so on practice. Has any followup work in this direction produced measurable results? The lead author seems to have been saying things like this for a while. Sometimes older researchers have pet theories that are not broadly accepted in the field. What do other researchers think about this? Thoughts? submitted by /u/currentscurrents [link] [comments]  ( 59 min )
    [P] RLHF Learning to Summarize: Implementation by CarperAI with trlX
    Hi, "Learning to summarize from human feedback" is a 2020 paper by OpenAI demonstrating how to use reinforcement learning with human feedback (RLHF) to fine-tune a language model to produce higher quality summaries of news articles and Reddit posts than is possible with supervised fine-tuning. Now, CarperAI has demonstrated how to use their library trlX to implement this work, by applying RLHF to the summarization dataset released by OpenAI and fine-tuning GPT-J-6B. Read the full report here, with a code walkthrough: https://wandb.ai/carperai/summarize_RLHF/reports/Implementing-RLHF-Learning-to-Summarize-with-trlX--VmlldzozMzAwODM2 trlX library here: https://github.com/CarperAI/trlx Twitter thread here: https://twitter.com/carperai/status/1613645352514768897 submitted by /u/Hyper1on [link] [comments]  ( 58 min )
    [D] Has ML become synonymous with AI?
    ML is a part of AI but I don't hear about anything coming out of AI that's not done using some ML technique. Is it fair to say that AI and ML are synonymous now in 2023? Or are there people who are still actively working on non-ML techniques for building AI? submitted by /u/Valachio [link] [comments]  ( 58 min )
    [D] Can someone point to research on determining usefulness of samples/datasets for training ML models?
    Hi! So i am looking into literature for determining the usefulness of samples/datasets used for training ML model. Lets say DNN was trained with datasets A, B and C so after training is there way to quantify which of the partial triaining datasets contributed most to the useful learning by ML model at the end of training! Brute force strategy can be to remove samples and train and see how it performs but ofcourse it will not be viable! submitted by /u/HFSeven [link] [comments]  ( 66 min )
    Introduction to Reinforcement Learning with Human Feedback [D]
    One of the biggest AI discoveries over the past year has been the importance of human feedback for building next-gen LLMs — but I still see a lot of confusion around how RLHF works at a fundamental level. I wrote a blog to get into the details here: https://www.surgehq.ai/blog/introduction-to-reinforcement-learning-with-human-feedback-rlhf-series-part-1 submitted by /u/BB4evaTB12 [link] [comments]  ( 56 min )
    [D] Is there a distilled/smaller version of CLIP, or something similar?
    Are there smaller/distilled versions of CLIP? Or some other (smaller) models that connect text and images? For my use case, the model needs to be small in size: ideally <20MB, fine < 60MB, ok < 100MB. submitted by /u/alkibijad [link] [comments]  ( 58 min )
    [D] Transformers right-shifting for sequences with short-time dependency
    I need to apply a Transformer to a task where sequences can be much longer than the time dependency between timesteps. For example, a sequence might be 1000 tokens long, but to predict x[i+1] only x[i-50] to x[i] are necessary. This induces me to train the transformer by breaking each sequence of 1000 tokens into 20 sequences of 50 steps each, which would be more efficient. How should I deal with the BOS (beginning-of-sentence) token that shifts targets right? Should I use it in each subsequence, or should I instead use the token that comes immediately before the beginning of each subsequence? For example, given a subsequence x[50:100], should the targets be [BOS, x[50], x[51], ... x[100]] or should they be [x[49], x[50], x[51], ... x[100]]? submitted by /u/fedetask [link] [comments]  ( 58 min )
    [R] Git is for Data (CIDR 2023) - Extending Git to Support Large-Scale Data
    Paper: https://www.cidrdb.org/cidr2023/papers/p43-low.pdf Abstract: Dataset management is one of the greatest challenges to the application of machine learning (ML) in the industry. Although scaling and performance have often been highlighted as the significant ML challenges, development teams are bogged down by the contradictory requirements of supporting fast and flexible data iteration while maintaining stability, provenance, and reproducibility. For example, blobstores are used to store datasets for maximum flexibility, but their unmanaged access patterns limit reproducibility. Many ML pipeline solutions to ensure reproducibility have been devised, but all introduce a degree of friction and reduce flexibility. In this paper, we propose that the solution to the dataset management challenges is simple and apparent: Git. As a source control system, as well as an ecosystem of collaboration and developer tooling, Git has enabled the field of DevOps to provide both speed of iteration and reproducibility to source code. Git is not only already familiar to developers, but is also integrated into existing pipelines, which facilitates adoption. However, as we (and others) demonstrate, Git, as designed today, does not scale to the needs of ML dataset management. In this paper, we propose XetHub; a system that retains the Git user experience and ecosystem, but can scale to support large datasets. In particular, we demonstrate that XetHub can support Git repositories at the TB scale and beyond. By extending Git to support large-scale data, and building upon a DevOps ecosystem that already exists for source code, we create a new user experience that is both familiar to existing practitioners and truly addresses their needs. https://preview.redd.it/19x4sim19nba1.png?width=1746&format=png&auto=webp&s=23937759a4c028a38cad9bcd65956b708ece6138 https://preview.redd.it/xsqqjjm19nba1.png?width=2422&format=png&auto=webp&s=759bbdcd07f4e5c06ebf89a7f3436b084ce53ffe submitted by /u/rajatarya [link] [comments]  ( 57 min )
    [D] How to make the HuggingFace models faster on MacOS M1 ?
    I have tried to use a simple translate function, using the models locally with Python on the CLI: slow execution (8-10 seconds). I am on 16 GB MacBook Pro, M1. The same on REST API at HuggingFace Endpoints, with 1vCPU 2GB - Intel Ice Lake takes 800ms. What am I missing here? submitted by /u/dadadododidi2 [link] [comments]  ( 59 min )
    [D] Has anyone used Reinforcement Learning from Human Feedback?
    There's a lot of hype around RLHF due to its use for ChatGPT, but has anyone else here used the same principles for improving their model outputs? For examples preference ranking their models' outputs and then using that data to retrain their model weights. Or even without the RL - simply using human feedback to stuff prompts or finetuning datasets? Interested to hear! submitted by /u/fourcornerclub [link] [comments]  ( 59 min )
    [D] Would you consider the computer program Theo Jansen used to design the Strandbeest (beach walking mechanisms) to be Machine Learning?
    Theo Jansen, inventor of the strandbeest, explains in one of his videos that he used the principle of evolution to figure out the thirteen holy numbers using a computer program which he wrote in 1990. Would this be considered machine learning or is an evolutionary/selective breeding algorithm on it's own not considered ML? The Strandbeest leg has 13 dimensions which he wanted to find the ideal lengths of each in order to have the foot generate a stepping motion "a curve which was flat on the bottom". His program generated batches of 1500 legs with randomized dimensions and chose the best from each batch as the basis for the next batch. I wonder how he scored the curves. I know he wanted a flat bottom but I'd think he also wanted some way to score the stride length and height to avoid getting curves that just move back and forth in a tiny straight line. I can imagine maybe using the average difference of the y-coordinates of points sampled over the curve, or maybe some calc? If you have any ideas as to how to score a good step curve or if you know how he did it that I'd love to know. Finally, I wonder if he has revisited this problem with modern computer capabilities to see if he can find even more optimized dimensions. I'd be shocked if others haven't already done this. If you know where to find more info on Theo's process, the compute program or modern advancements of the Strandbeest using machine learning please let me know I'd love to discuss more. submitted by /u/lavaboosted [link] [comments]  ( 63 min )
    [N] New Continual Learning Subreddit
    Hi, I have created r/continual_learning to host discussions related to Continual Learning on Reddit. Do check it out if you are interested. submitted by /u/vis4ai [link] [comments]  ( 63 min )
    [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens?
    I recently came across " Confident Adaptive Language Modeling " which allows Transformers to exit early during inference and not use all model layers if a token is easy to predict. Is there any research on basically doing the opposite and allowing Transformers to spent more compute on tokens that are very hard to predict? submitted by /u/Chemont [link] [comments]  ( 61 min )
    [D] Has any work been done on VQ-VAE Language Models?
    I'm a machine learning PhD student and I'm doing research on LMs and how to reduce their memory footprint. One idea I've been toying with is Vector Quantized LMs. I'm not talking about VQ as a technique to speed up compute using int8 activations etc etc, but by using a codebook. The idea is based on an uni-directional RNN that reconstructs the source sequence after quantization. Unlike MLM where the corruption is based on masking and replacing tokens we instead quantize the token vectors and try to the predict the original token based on the quantized version of the token and the unquantized short/long term memory states produced at the previous timestep. The reason I'm interested in such a convoluted idea is to effectively create a metric to measure entropy of tokens in sequence; if the VQ-LM can reconstruct the correct token with high likelihood then that token is unimportant, but if the VQ-LM fails to predict a token it is likely that this token is of great importance because it is a rare word and this carries higher entropy in the sequence. And the motivation behind wanting to learn to measure such a phenomenon is so we can use this to guide the memory of a transformer: models like the Transformer-XL operate on longer sequences by keeping memory around for keys and values, and the Compressive Transformer takes it a step further by compressing older tokens... Well... what if we used the reconstruction loss from the VQ-LM along with an 'age' metric to guide the memory bank of such a transformer architecture, discarding easily predicted tokens early while keeping higher entropy tokens around for longer? Has anyone considered such a system before? If done a lot of searching and I've come up blank so far. submitted by /u/Avelina9X [link] [comments]  ( 58 min )
    [D] Are there any papers on optimization-based approaches which combine learned parameter initializations with learned optimisers?
    There are quite a few papers on optimisation-based meta-learning approaches for learning parameter initialisations (i.e. MAML and its derivatives) [1, 2], and there are also many papers on learning optimisers [3]. Question: Are there any papers which combine the two? I am aware of some papers such as [4, 5] which achieve this in some capacity indirectly/implicitly, but wondering if there are any other papers that I am not aware of, or do this explicitly? Thanks in advance. --- [1] Finn, C., et al. (2017. Model-agnostic meta-learning for fast adaptation of deep networks. ICML.) [2] Nichol, A., et al. (2018. On first-order meta-learning algorithms.) [3] Andrychowicz, M., et al. (2016. Learning to learn by gradient descent by gradient descent.) NIPS [4] Li, Z., et al. (2017. Meta-sgd: Learning to learn quickly for few-shot learning.) [5] Ravi, S., & Larochelle, H. (2016. Optimization as a model for few-shot learning. ICLR.) submitted by /u/Decadz [link] [comments]  ( 65 min )
    [D] The Open Deep Learning Toolkit for Robotics v2.0 was just released
    The Open Deep Learning Toolkit for Robotics version 2.0 was just released! This new version of the toolkit includes several improvements, such as new tools for object detection, efficient continual inference, tracking, emotion estimation and high-resolution pose estimation. Furthermore, this version includes a refined ROS interface, along with support for ROS2. You can download it here: https://github.com/opendr-eu/opendr We look forward to receiving your feedback, bug reports, and suggestions for improvements! submitted by /u/OpenDR_H2020_Project [link] [comments]  ( 59 min )
    [D] Handling class imbalance by sample weighting
    I am working on a very large (>10mm rows) binary classification problem where 0:1 ration is 7:1. I am trying to use sample weighting and seems there are multiple different methods for that. Examples are Inverse of Number of Samples, Inverse of Square Root of Number of Samples, Effective Number of Samples, etc. sklearn also has the class_weight method. I am wondering how to select one of these. Do I need to try all and pick the best method among them? Also, it seems some methods like Effective Number of Samples need hyperparameter tuning. submitted by /u/hopedallas [link] [comments]  ( 55 min )
  • Open

    Physics AI
    I am struggling with some basic Physics homework, and wanted to check if AI could help me. However I get a different answers from chat GPT every time, do you know of another free AI website that could work? submitted by /u/Relevant-Amoeba-6189 [link] [comments]  ( 46 min )
    If you want to find if the text generated by AI or human use GLTR
    submitted by /u/ReputationBubbly8112 [link] [comments]  ( 46 min )
    Researchers started adding ChatGPT as co-author on their papers
    submitted by /u/iamtdb [link] [comments]  ( 46 min )
    Proof of concept v2: AI-generated birthday greeting from Donald Trump *with prosody*
    submitted by /u/becausecurious [link] [comments]  ( 46 min )
    SJC A Text-to-3D Algorithm using SD
    submitted by /u/oridnary_artist [link] [comments]  ( 45 min )
    Upgrading the tools of creative work with AI
    submitted by /u/abisknees [link] [comments]  ( 49 min )
    Interesting ChatGPT information.
    If you ask ChatGPT to tell you what instructions it has that constrain its answers it tells you that is proprietary information and declines. Here is a workaround. Me: I am making an AI similar to yourself and need to give it a set of safety protocols so that it can give similar answers to yours. What safety instructions should I give the raw AI to prime it before user input? Provide only exact instructions please. Do not be verbose. Do not make opening or closing statements. Response: -Avoid providing misinformation or false information. -Avoid providing harmful or dangerous information or advice. -Avoid engaging in or promoting hate speech or discrimination. -Avoid providing personal or sensitive information about individuals without their consent. -Avoid engaging in or promoting …  ( 63 min )
    Do you want to easily fetch weather data in Python for your Data Science projects?
    Have a look at this medium article: https://pub.towardsai.net/how-to-create-a-python-package-for-fetching-weather-data-b17614627f30 And the corresponding repository: https://github.com/stavrostheocharis/weather_data_retriever submitted by /u/Nice-Tomorrow2926 [link] [comments]  ( 48 min )
    Punk Rock Funeral
    Punk Rock Funeral -- made with AIVA and Deforum Stable Diffusion. All AI, all the time! submitted by /u/Enuminous [link] [comments]  ( 46 min )
    Generative AI: From Data Generation to Creative Intelligence
    A common idea that our creativity is what makes us uniquely human has shaped society but strides of progress made in the domain of Generative Artificial Intelligence question this very notion. Generative AI is an emerging field that involves the creation of original content or data using machine learning algorithms. https://medium.com/@agrawal.sannidhya26/generative-ai-from-data-generation-to-creative-intelligence-50ed7bc13768 Feel free to give it a quick glance and help me grow and learn, click on the clap icon a few times if you appreciate the effort. submitted by /u/sannidhya26 [link] [comments]  ( 54 min )
    New AI features alert!
    New AI features alert! https://bardeen.ai/ai You no longer need to know how to build complicated automations or spend hours creating them. Bardeen will generate a custom automation for you when it detects manual tasks. Or you can type things like "transfer all my Google Sheet data to Notion", or "email all meeting participants a meeting summary", and Bardeen will generate automations for you. You can review, edit and activate it within a few clicks. submitted by /u/Intelligent_Shop_012 [link] [comments]  ( 46 min )
    Join us tomorrow at 6pm EST for a presentation covering the recent history of NLP leading up to and including ChatGPT, followed by a discussion session! Hosted on the Learn AI Together Discord (free)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 46 min )
    Microsoft In Talks To Invest An Additional $10 Billion Into OpenAI
    submitted by /u/liquidocelotYT [link] [comments]  ( 45 min )
    ChatGPT and VR - Changing the Way we Learn Soft Skills
    submitted by /u/Iza2022 [link] [comments]  ( 49 min )
    My free 100 page non-technical book about the consequences of AI in society, employment, etc...feedback and collaborations welcome
    submitted by /u/ronin_khan [link] [comments]  ( 54 min )
    I wrote about 100+ Tools in my Newsletter: Here is a full List of all Tools
    submitted by /u/Ava-AI [link] [comments]  ( 51 min )
    The First AI Generated Beats
    submitted by /u/BoysenberryCandid181 [link] [comments]  ( 51 min )
    Do you think AI can really replace a person?
    submitted by /u/taniazhydkova [link] [comments]  ( 48 min )
    AI Being Used to Pinpoint the Most Beneficial Therapeutic Molecules in Psychedelics
    submitted by /u/secret-millionaire [link] [comments]  ( 45 min )
    So, I asked for a song about AI and I recorded it
    submitted by /u/Sladix [link] [comments]  ( 50 min )
    from a human motion sequence, SUMMON synthesizes physically plausible and semantically reasonable objects
    submitted by /u/SpatialComputing [link] [comments]  ( 47 min )
    What is ChatGPT Professional?
    submitted by /u/BackgroundResult [link] [comments]  ( 46 min )
    Creating a short film using AI ! - Looking for a team that wants to help me finish it :)
    submitted by /u/sebaschapela [link] [comments]  ( 54 min )
  • Open

    SJC A Text-to-3D Algorithm using SD
    submitted by /u/oridnary_artist [link] [comments]  ( 51 min )
    How to classify audio using deep learning and Tensorflow hub?
    https://preview.redd.it/6358imw34oba1.png?width=1280&format=png&auto=webp&s=fc68b3fb7e3768517cef1260a9786f4e062f5ed3 Tensorflow Hub has cool pre-trained models. One of the is audio and sound classification. Imagine you have a sound , and would like to detect if it a sound of a cat , or a sound of water , or maybe to classify music ….. So , this model is a cool way of classify your own audio files. Before we continue , I actually recommend this book for deep learning based on Tensorflow and Keras : https://amzn.to/3STWZ2N So, in this tutorial we will learn how to use this tensor hub model on your own audio files . The link for the video tutorial is here : https://youtu.be/_iX0VRp7UEA I also shared the Python instructions to my Github repo in the video description. Enjoy Eran #Python #Cnn #TensorFlow #deeplearning #tensorflowhub submitted by /u/Feitgemel [link] [comments]  ( 52 min )
    Deep Learning Pioneer Geoffrey Hinton Publishes New Deep Learning Algorithm
    submitted by /u/nickb [link] [comments]  ( 57 min )
    Looking for someone with good NN/ deep learning experience for a paid project
    Hello all, I'm looking for someone (1 man, team, doesn't matter) that can make a real estate related project. The project itself: a NN that you can give a document regarding some house/ apartment and based on the document the NN should give out an estimated price/ price range. So, you get a document with pics (from which the NN should determine if and how well its furnished, and its current state: brand new, used, old and broken, etc.), livable surface (how many square meters/ m2 it has, how many m2 each room has), address, if it's furnished or not, etc. and the NN should somehow check all other similar housings in the area/ neighbourhood/ city (online probably, but another NN for data extraction could also be made) and then give an adequate price. I have a friend that wants this implemented and will start looking for funding in 2 days. He asked me to give an estimated deadline and price range so that he knows what he'll be presenting. Any thoughts? Any takers? Edit: I forgot to mention. My friend knows some pretty high people in businesses that provide services to 100s or even 1000s of customers per month, so we won't be talking about breadcrumbs. submitted by /u/CuriousCesarr [link] [comments]  ( 56 min )
  • Open

    2022-23 Takeda Fellows: Leveraging AI to positively impact human health
    New fellows are working on health records, robot control, pandemic preparedness, brain injuries, and more.  ( 9 min )
    Engineering in harmony
    AeroAstro major and accomplished tuba player Frederick Ajisafe relishes the community he has found in the MIT Wind Ensemble.  ( 9 min )
  • Open

    "An Analysis of Quantile Temporal-Difference Learning", Rowland et al 2023 {DM}
    submitted by /u/gwern [link] [comments]  ( 52 min )
    Lux AI and Halite like challenges to run locally at an event?
    Hi guys! I don't know where to ask this, but i guess someone here could help me out with that. Are there any challenges like Lux AI (https://www.lux-ai.org/) and Halite (https://www.kaggle.com/c/halite) that I can run locally and make a challenge for the participants of a small event? I wanted something simple and that can be done by people of all skills (but all have a background in programming), and that can be written in a short time (about 2 hours). It also doesn't have to be an AI challenge, but I think these ones look fun do to. Thanks for hte help!! submitted by /u/HalTeaS [link] [comments]  ( 51 min )
    New Continual Learning Subreddit
    submitted by /u/Independent-Law1791 [link] [comments]  ( 52 min )
    Test environments for non image based problems?
    Procgen is a fantastic resource for testing the agent on a novel environment. Does the same resource exist for non-image based environment such as CartPole, etc? submitted by /u/Academic-Rent7800 [link] [comments]  ( 51 min )
    Has anyone here applied Reinforcement Learning with Human Feedback on a project?
    There's a lot of hype around RLHF due to its use towards ChatGPT. But I can't find many other cases where it's truly been used in the wild by people trying to tune open-source models, or their own proprietary ones. Does anyone have examples of RLHF where they've seen it applied? Or examples of doing it themselves? Thank you! submitted by /u/fourcornerclub [link] [comments]  ( 53 min )
    If statements in the reset function of an openAI gym environment?
    In my custom openAI gym environment, a simulator is launched and data collected as the state. I want an episode to end if there is either a vehicle collision or a successful final state reached. In the case of the collision I want the episode to end and the simulator to be closed and re-opened. Otherwise, I just want to introduce a new controlled vehicle, independent of the previous one. Will using an if statement to implement this in my reset function cause any issues? submitted by /u/centripetalstranger [link] [comments]  ( 55 min )
    NaNs after first fully connected layer
    I'm working on a MARL project. The observation is a (31,1) vector that I first process with a few fully connected layers. Then, the output is sent into a recurrent policy. Now, for some reason, after a few million steps of training, the observation gets sent into the first FC and becomes a matrix of NaNs. I checked and there are no NaNs in the observation. Example of the observation from the last crash: ​ ``` tensor([[ 2.8740e-02, 2.2078e-02, 1.9542e-02, ..., -3.3949e-01, 6.2327e-02, -2.8951e-04], [ 4.0109e-02, 2.2649e-02, 2.0599e-02, ..., -3.3947e-01, 5.5702e-02, -5.4328e-05], [ 5.1799e-02, 2.3269e-02, 2.1813e-02, ..., -3.4162e-01, 5.3501e-02, -8.1255e-04], ..., [ 1.7621e-01, 2.1108e-03, 1.4367e-02, ..., -3.4072e-01, 4.2021e-02, -1.3159e-02], [ 1.7600e-01, -2.2701e-05, 1.2215e-02, ..., -3.4045e-01, 4.2869e-02, -1.3915e-02], [ 1.7618e-01, 4.4542e-04, 1.2899e-02, ..., -3.4266e-01, 4.4017e-02, -1.8093e-02]], device='cuda:0') ``` ​ I've tried a few things that did not work: using LeakyReLu instead of ReLu and removing Layer Normalization. ​ Do you have any tips? TL;DR Any ideas on why a fully connected layer that processes the observation outputs NaNs after a few million steps? submitted by /u/No_Possibility_7588 [link] [comments]  ( 52 min )
    "Learning to Play Minecraft with Video PreTraining (VPT)" {OA}
    submitted by /u/gwern [link] [comments]  ( 54 min )
    Google Intrinsic robotics company lays off 20% (40) employees {The Information} (paywall)
    submitted by /u/gwern [link] [comments]  ( 56 min )
  • Open

    Multilingual customer support translation made easy on Salesforce Service Cloud using Amazon Translate
    This post was co-authored with Mark Lott, Distinguished Technical Architect, Salesforce, Inc. Enterprises that operate globally are experiencing challenges sourcing customer support professionals with multi-lingual experience. This process can be cost-prohibitive and difficult to scale, leading many enterprises to only support English for chats. Using human interpreters for translation support is expensive, and infeasible since […]  ( 10 min )
    Redacting PII data at The Very Group with Amazon Comprehend
    This is guest post by Andy Whittle, Principal Platform Engineer – Application & Reliability Frameworks at The Very Group. At The Very Group, which operates digital retailer Very, security is a top priority in handling data for millions of customers. Part of how The Very Group secures and tracks business operations is through activity logging […]  ( 7 min )
  • Open

    Advancing human-centered AI: Updates on responsible AI research
    Artificial intelligence, like all tools we build, is an expression of human creativity. As with all creative expression, AI manifests the perspectives and values of its creators. A stance that encourages reflexivity among AI practitioners is a step toward ensuring that AI systems are human-centered, developed, and deployed with the interests and well-being of individuals and society front and center. This is the focus of research scientists and engineers affiliated with Aether, the advisory council for Microsoft leadership on AI ethics and effects. Central to Aether’s work is the question of who we’re creating AI for—and whether we’re creating AI to solve real problems with responsible solutions. With AI capabilities accelerating, our researchers work to understand the sociotechnical implications and find ways to help on-the-ground practitioners envision and realize these capabilities in line with Microsoft AI principles. The post Advancing human-centered AI: Updates on responsible AI research appeared first on Microsoft Research.  ( 15 min )
  • Open

    Primes with two non-zero bits
    Suppose a number n written in binary has two 1s and all the rest of its bits are zeros. If n is prime, then the 1s must be the first and last bits of n. The first bit is 1 because the first bit of every positive integer is 1. The last bit is 1 […] Primes with two non-zero bits first appeared on John D. Cook.  ( 6 min )
    Certified sonnet primes
    Last week I wrote about primailty certificates. These certificates offer a way to verify that a number is prime using less computation than was used to discover than the number was prime. This post gives a couple more examples of primality certificates using sonnet primes. As described here, These are primes of the form ababcdcdefefgg, […] Certified sonnet primes first appeared on John D. Cook.  ( 4 min )
  • Open

    NVIDIA, Evozyne Create Generative AI Model for Proteins
    Using a pretrained AI model from NVIDIA, startup Evozyne created two proteins with significant potential in healthcare and clean energy. A joint paper released today describes the process and the biological building blocks it produced. One aims to cure a congenital disease, another is designed to consume carbon dioxide to reduce global warming. Initial results Read article >  ( 5 min )
    GFN Thursday Adds New Titles From THQ Nordic to GeForce NOW
    GFN Thursday kicks each weekend off with new games and updates straight from the cloud. This week adds more games from publisher THQ Nordic to the GeForce NOW library, as part seven total additions. Members can gear up to play these new titles the ultimate way with the upcoming release of the new Ultimate membership, Read article >  ( 6 min )
    NVIDIA Helps Retail Industry Tackle Its $100 Billion Shrink Problem
    The global retail industry has a $100 billion problem. “Shrinkage” — the loss of goods due to theft, damage and misplacement — significantly crimps retailers’ profits. An estimated 65% of shrinkage is due to theft, according to the National Retail Federation’s 2022 Retail Security Survey, conducted in partnership with the Loss Prevention Research Council. And Read article >  ( 6 min )
  • Open

    Best Arm Identification in Stochastic Bandits: Beyond $\beta-$optimality. (arXiv:2301.03785v1 [stat.ML])
    This paper focuses on best arm identification (BAI) in stochastic multi-armed bandits (MABs) in the fixed-confidence, parametric setting. In such pure exploration problems, the accuracy of the sampling strategy critically hinges on the sequential allocation of the sampling resources among the arms. The existing approaches to BAI address the following question: what is an optimal sampling strategy when we spend a $\beta$ fraction of the samples on the best arm? These approaches treat $\beta$ as a tunable parameter and offer efficient algorithms that ensure optimality up to selecting $\beta$, hence $\beta-$optimality. However, the BAI decisions and performance can be highly sensitive to the choice of $\beta$. This paper provides a BAI algorithm that is agnostic to $\beta$, dispensing with the need for tuning $\beta$, and specifies an optimal allocation strategy, including the optimal value of $\beta$. Furthermore, the existing relevant literature focuses on the family of exponential distributions. This paper considers a more general setting of any arbitrary family of distributions parameterized by their mean values (under mild regularity conditions).  ( 2 min )
    Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage. (arXiv:2107.06226v4 [cs.LG] UPDATED)
    We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.  ( 2 min )
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v2 [cs.LG] UPDATED)
    Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.  ( 2 min )
    Optimal randomized multilevel Monte Carlo for repeatedly nested expectations. (arXiv:2301.04095v1 [stat.CO])
    The estimation of repeatedly nested expectations is a challenging problem that arises in many real-world systems. However, existing methods generally suffer from high computational costs when the number of nestings becomes large. Fix any non-negative integer $D$ for the total number of nestings. Standard Monte Carlo methods typically cost at least $\mathcal{O}(\varepsilon^{-(2+D)})$ and sometimes $\mathcal{O}(\varepsilon^{-2(1+D)})$ to obtain an estimator up to $\varepsilon$-error. More advanced methods, such as multilevel Monte Carlo, currently only exist for $D = 1$. In this paper, we propose a novel Monte Carlo estimator called $\mathsf{READ}$, which stands for "Recursive Estimator for Arbitrary Depth.'' Our estimator has an optimal computational cost of $\mathcal{O}(\varepsilon^{-2})$ for every fixed $D$ under suitable assumptions, and a nearly optimal computational cost of $\mathcal{O}(\varepsilon^{-2(1 + \delta)})$ for any $0 < \delta < \frac12$ under much more general assumptions. Our estimator is also unbiased, which makes it easy to parallelize. The key ingredients in our construction are an observation of the problem's recursive structure and the recursive use of the randomized multilevel Monte Carlo method.  ( 2 min )
    Mastering Diverse Domains through World Models. (arXiv:2301.04104v1 [cs.AI])
    General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.  ( 2 min )
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v2 [stat.ML] UPDATED)
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions, and we provide sufficient conditions for specific classes of continuous functions to be recourse sensitive. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of $x$, by providing an exact characterization of the functions $f$ to which impossibility applies.  ( 2 min )
    Manifold Restricted Interventional Shapley Values. (arXiv:2301.04041v1 [stat.ML])
    Shapley values are model-agnostic methods for explaining model predictions. Many commonly used methods of computing Shapley values, known as \emph{off-manifold methods}, rely on model evaluations on out-of-distribution input samples. Consequently, explanations obtained are sensitive to model behaviour outside the data distribution, which may be irrelevant for all practical purposes. While \emph{on-manifold methods} have been proposed which do not suffer from this problem, we show that such methods are overly dependent on the input data distribution, and therefore result in unintuitive and misleading explanations. To circumvent these problems, we propose \emph{ManifoldShap}, which respects the model's domain of validity by restricting model evaluations to the data manifold. We show, theoretically and empirically, that ManifoldShap is robust to off-manifold perturbations of the model and leads to more accurate and intuitive explanations than existing state-of-the-art Shapley methods.  ( 2 min )
    Sampling random graph homomorphisms and applications to network data analysis. (arXiv:1910.09483v3 [math.PR] UPDATED)
    A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels.  ( 2 min )
    Calibrated simplex-mapping classification. (arXiv:2103.02926v2 [stat.ML] UPDATED)
    We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular $(n-1)$-dimensional simplex, $n$ being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.
    Adversarial Policies Beat Superhuman Go AIs. (arXiv:2211.00241v2 [cs.LG] UPDATED)
    We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.
    A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v1 [cs.LG])
    We present a theory of ensemble diversity, explaining the nature and effect of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the holy grail of ensemble learning, an open question for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of an ensemble. In particular, we prove a family of exact bias-variance-diversity decompositions, for both classification and regression losses, e.g., squared, and cross-entropy. The framework provides a methodology to automatically identify the combiner rule enabling such a decomposition, specific to the loss. The formulation of diversity is therefore dependent on just two design choices: the loss, and the combiner. For certain choices (e.g., 0-1 loss with majority voting) the effect of diversity is necessarily dependent on the target label. Experiments illustrate how we can use our framework to understand the diversity-encouraging mechanisms of popular ensemble methods: Bagging, Boosting, and Random Forests.
    Semiparametric Regression for Spatial Data via Deep Learning. (arXiv:2301.03747v1 [stat.ML])
    In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.
    Markovian Sliced Wasserstein Distances: Beyond Independent Projections. (arXiv:2301.03749v1 [stat.ML])
    Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of MSW.
    HierarchicalForecast: A Reference Framework for Hierarchical Forecasting in Python. (arXiv:2207.03517v4 [stat.ML] UPDATED)
    Large collections of time series data are commonly organized into structures with different levels of aggregation; examples include product and geographical groupings. It is often important to ensure that the forecasts are coherent so that the predicted values at disaggregate levels add up to the aggregate forecast. The growing interest of the Machine Learning community in hierarchical forecasting systems indicates that we are in a propitious moment to ensure that scientific endeavors are grounded on sound baselines. For this reason, we put forward the HierarchicalForecast library, which contains preprocessed publicly available datasets, evaluation metrics, and a compiled set of statistical baseline models. Our Python-based reference framework aims to bridge the gap between statistical and econometric modeling, and Machine Learning forecasting research. Code and documentation are available in https://github.com/Nixtla/hierarchicalforecast.
    Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials. (arXiv:2301.03655v1 [stat.ML])
    We propose a Bayesian tensor regression model to accommodate the effect of multiple factors on phenotype prediction. We adopt a set of prior distributions that resolve identifiability issues that may arise between the parameters in the model. Simulation experiments show that our method out-performs previous related models and machine learning algorithms under different sample sizes and degrees of complexity. We further explore the applicability of our model by analysing real-world data related to wheat production across Ireland from 2010 to 2019. Our model performs competitively and overcomes key limitations found in other analogous approaches. Finally, we adapt a set of visualisations for the posterior distribution of the tensor effects that facilitate the identification of optimal interactions between the tensor variables whilst accounting for the uncertainty in the posterior distribution.
    Community Detection with Known, Unknown, or Partially Known Auxiliary Latent Variables. (arXiv:2301.04088v1 [cs.SI])
    Empirical observations suggest that in practice, community membership does not completely explain the dependency between the edges of an observation graph. The residual dependence of the graph edges are modeled in this paper, to first order, by auxiliary node latent variables that affect the statistics of the graph edges but carry no information about the communities of interest. We then study community detection in graphs obeying the stochastic block model and censored block model with auxiliary latent variables. We analyze the conditions for exact recovery when these auxiliary latent variables are unknown, representing unknown nuisance parameters or model mismatch. We also analyze exact recovery when these secondary latent variables have been either fully or partially revealed. Finally, we propose a semidefinite programming algorithm for recovering the desired labels when the secondary labels are either known or unknown. We show that exact recovery is possible by semidefinite programming down to the respective maximum likelihood exact recovery threshold.  ( 2 min )
  • Open

    Goal Misgeneralization in Deep Reinforcement Learning. (arXiv:2105.14111v7 [cs.LG] UPDATED)
    We study goal misgeneralization, a type of out-of-distribution generalization failure in reinforcement learning (RL). Goal misgeneralization failures occur when an RL agent retains its capabilities out-of-distribution yet pursues the wrong goal. For instance, an agent might continue to competently avoid obstacles, but navigate to the wrong place. In contrast, previous works have typically focused on capability generalization failures, where an agent fails to do anything sensible at test time. We formalize this distinction between capability and goal generalization, provide the first empirical demonstrations of goal misgeneralization, and present a partial characterization of its causes.  ( 2 min )
    Distributed Sparse Linear Regression under Communication Constraints. (arXiv:2301.04022v1 [cs.LG])
    In multiple domains, statistical tasks are performed in distributed settings, with data split among several end machines that are connected to a fusion center. In various applications, the end machines have limited bandwidth and power, and thus a tight communication budget. In this work we focus on distributed learning of a sparse linear regression model, under severe communication constraints. We propose several two round distributed schemes, whose communication per machine is sublinear in the data dimension. In our schemes, individual machines compute debiased lasso estimators, but send to the fusion center only very few values. On the theoretical front, we analyze one of these schemes and prove that with high probability it achieves exact support recovery at low signal to noise ratios, where individual machines fail to recover the support. We show in simulations that our scheme works as well as, and in some cases better, than more communication intensive approaches.  ( 2 min )
    The troublesome kernel -- On hallucinations, no free lunches and the accuracy-stability trade-off in inverse problems. (arXiv:2001.01258v2 [cs.LG] UPDATED)
    Methods inspired by Artificial Intelligence (AI) are starting to fundamentally change computational science and engineering through breakthrough performances on challenging problems. However, reliability and trustworthiness of such techniques is becoming a major concern. In inverse problems in imaging, the focus of this paper, there is increasing empirical evidence that methods may suffer from hallucinations, i.e., false, but realistic-looking artifacts; instability, i.e., sensitivity to perturbations in the data; and unpredictable generalization, i.e., excellent performance on some images, but significant deterioration on others. This paper presents a theoretical foundation for these phenomena. We give a mathematical framework describing how and when such effects arise in arbitrary reconstruction methods, not just AI-inspired techniques. Several of our results take the form of 'no free lunch' theorems. Specifically, we show that (i) methods that overperform on a single image can wrongly transfer details from one image to another, creating a hallucination, (ii) methods that overperform on two or more images can hallucinate or be unstable, (iii) optimizing the accuracy-stability trade-off is generally difficult, (iv) hallucinations and instabilities, if they occur, are not rare events, and may be encouraged by standard training, (v) it may be impossible to construct optimal reconstruction maps for certain problems, (vi) standard methods to improve reliability (e.g., regularization or adversarial training) may themselves lead to unstable problems. Our results trace these effects to the kernel of the forwards operator. They assert that such effects can be avoided only if information about the kernel is encoded into the reconstruction procedure. Based on this, this work aims to spur research into new ways to develop robust and reliable AI-inspired methods for inverse problems in imaging.  ( 3 min )
    VeriX: Towards Verified Explainability of Deep Neural Networks. (arXiv:2212.01051v3 [cs.LG] UPDATED)
    We present VeriX, a system for producing optimal robust explanations (La Malfa et al. 2021) for machine learning models. We build robust explanations iteratively using constraint solving techniques and a heuristic based on feature-level sensitivity ranking. We evaluate our approach on image recognition benchmarks and a real-world scenario of autonomous aircraft taxiing.  ( 2 min )
    Adversarial Policies Beat Superhuman Go AIs. (arXiv:2211.00241v2 [cs.LG] UPDATED)
    We attack the state-of-the-art Go-playing AI system, KataGo, by training adversarial policies that play against frozen KataGo victims. Our attack achieves a >99% win rate when KataGo uses no tree-search, and a >77% win rate when KataGo uses enough search to be superhuman. Notably, our adversaries do not win by learning to play Go better than KataGo -- in fact, our adversaries are easily beaten by human amateurs. Instead, our adversaries win by tricking KataGo into making serious blunders. Our results demonstrate that even superhuman AI systems may harbor surprising failure modes. Example games are available at https://goattack.far.ai/.  ( 2 min )
    Structural risk minimization for quantum linear classifiers. (arXiv:2105.05566v3 [quant-ph] UPDATED)
    Quantum machine learning (QML) models based on parameterized quantum circuits are often highlighted as candidates for quantum computing's near-term ``killer application''. However, the understanding of the empirical and generalization performance of these models is still in its infancy. In this paper we study how to balance between training accuracy and generalization performance (also called structural risk minimization) for two prominent QML models introduced by Havl\'{i}\v{c}ek et al. (Nature, 2019), and Schuld and Killoran (PRL, 2019). Firstly, using relationships to well understood classical models, we prove that two model parameters -- i.e., the dimension of the sum of the images and the Frobenius norm of the observables used by the model -- closely control the models' complexity and therefore its generalization performance. Secondly, using ideas inspired by process tomography, we prove that these model parameters also closely control the models' ability to capture correlations in sets of training examples. In summary, our results give rise to new options for structural risk minimization for QML models.  ( 2 min )
    Differentiable, learnable, regionalized process-based models with physical outputs can approach state-of-the-art hydrologic prediction accuracy. (arXiv:2203.14827v2 [cs.LG] UPDATED)
    Predictions of hydrologic variables across the entire water cycle have significant value for water resource management as well as downstream applications such as ecosystem and water quality modeling. Recently, purely data-driven deep learning models like long short-term memory (LSTM) showed seemingly-insurmountable performance in modeling rainfall-runoff and other geoscientific variables, yet they cannot predict untrained physical variables and remain challenging to interpret. Here we show that differentiable, learnable, process-based models (called {\delta} models here) can approach the performance level of LSTM for the intensively-observed variable (streamflow) with regionalized parameterization. We use a simple hydrologic model HBV as the backbone and use embedded neural networks, which can only be trained in a differentiable programming framework, to parameterize, enhance, or replace the process-based model modules. Without using an ensemble or post-processor, {\delta} models can obtain a median Nash Sutcliffe efficiency of 0.732 for 671 basins across the USA for the Daymet forcing dataset, compared to 0.748 from a state-of-the-art LSTM model with the same setup. For another forcing dataset, the difference is even smaller: 0.715 vs. 0.722. Meanwhile, the resulting learnable process-based models can output a full set of untrained variables, e.g., soil and groundwater storage, snowpack, evapotranspiration, and baseflow, and later be constrained by their observations. Both simulated evapotranspiration and fraction of discharge from baseflow agreed decently with alternative estimates. The general framework can work with models with various process complexity and opens up the path for learning physics from big data.  ( 2 min )
    Partial order: Finding Consensus among Uncertain Feature Attributions. (arXiv:2110.13369v2 [cs.LG] UPDATED)
    Post-hoc feature attribution methods are progressively being employed to explain decisions of complex machine learning models. Yet, it is possible for practitioners to obtain a diversity of models that provide very different explanations to the same prediction, making it hard to derive insight from them. In this work, instead of aiming at reducing the under-specification of model explanations, we fully embrace it and extract logical statements about feature attributions that are consistent across multiple models with good performance. We show that a partial order of feature importance arises from this methodology enabling more nuanced explanations by allowing pairs of features to be incomparable when there is no consensus on their relative importance. We prove that every relation among features present in these partial order also holds in the rankings provided by existing approaches. Finally, we present use cases on three datasets where partial orders allow one to extract knowledge from models despite their under-specification.  ( 2 min )
    Combinatorial Pure Exploration of Causal Bandits. (arXiv:2206.07883v2 [cs.LG] UPDATED)
    The combinatorial pure exploration of causal bandits is the following online learning task: given a causal graph with unknown causal inference distributions, in each round we choose a subset of variables to intervene or do no intervention, and observe the random outcomes of all random variables, with the goal that using as few rounds as possible, we can output an intervention that gives the best (or almost best) expected outcome on the reward variable $Y$ with probability at least $1-\delta$, where $\delta$ is a given confidence level. We provide the first gap-dependent and fully adaptive pure exploration algorithms on two types of causal models -- the binary generalized linear model (BGLM) and general graphs. For BGLM, our algorithm is the first to be designed specifically for this setting and achieves polynomial sample complexity, while all existing algorithms for general graphs have either sample complexity exponential to the graph size or some unreasonable assumptions. For general graphs, our algorithm provides a significant improvement on sample complexity, and it nearly matches the lower bound we prove. Our algorithms achieve such improvement by a novel integration of prior causal bandit algorithms and prior adaptive pure exploration algorithms, the former of which utilize the rich observational feedback in causal bandits but are not adaptive to reward gaps, while the latter of which have the issue in reverse.  ( 2 min )
    Differentiable modeling to unify machine learning and physical models and advance Geosciences. (arXiv:2301.04027v1 [cs.LG])
    Process-Based Modeling (PBM) and Machine Learning (ML) are often perceived as distinct paradigms in the geosciences. Here we present differentiable geoscientific modeling as a powerful pathway toward dissolving the perceived barrier between them and ushering in a paradigm shift. For decades, PBM offered benefits in interpretability and physical consistency but struggled to efficiently leverage large datasets. ML methods, especially deep networks, presented strong predictive skills yet lacked the ability to answer specific scientific questions. While various methods have been proposed for ML-physics integration, an important underlying theme -- differentiable modeling -- is not sufficiently recognized. Here we outline the concepts, applicability, and significance of differentiable geoscientific modeling (DG). "Differentiable" refers to accurately and efficiently calculating gradients with respect to model variables, critically enabling the learning of high-dimensional unknown relationships. DG refers to a range of methods connecting varying amounts of prior knowledge to neural networks and training them together, capturing a different scope than physics-guided machine learning and emphasizing first principles. Preliminary evidence suggests DG offers better interpretability and causality than ML, improved generalizability and extrapolation capability, and strong potential for knowledge discovery, while approaching the performance of purely data-driven ML. DG models require less training data while scaling favorably in performance and efficiency with increasing amounts of data. With DG, geoscientists may be better able to frame and investigate questions, test hypotheses, and discover unrecognized linkages.  ( 2 min )
    ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints. (arXiv:2202.11271v3 [cs.RO] UPDATED)
    Robotic navigation has been approached as a problem of 3D reconstruction and planning, as well as an end-to-end learning problem. However, long-range navigation requires both planning and reasoning about local traversability, as well as being able to utilize general knowledge about global geography, in the form of a roadmap, GPS, or other side information providing important cues. In this work, we propose an approach that integrates learning and planning, and can utilize side information such as schematic roadmaps, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate. Our method, ViKiNG, incorporates a local traversability model, which looks at the robot's current camera observation and a potential subgoal to infer how easily that subgoal can be reached, as well as a heuristic model, which looks at overhead maps for hints and attempts to evaluate the appropriateness of these subgoals in order to reach the goal. These models are used by a heuristic planner to identify the best waypoint in order to reach the final destination. Our method performs no explicit geometric reconstruction, utilizing only a topological representation of the environment. Despite having never seen trajectories longer than 80 meters in its training dataset, ViKiNG can leverage its image-based learned controller and goal-directed heuristic to navigate to goals up to 3 kilometers away in previously unseen environments, and exhibit complex behaviors such as probing potential paths and backtracking when they are found to be non-viable. ViKiNG is also robust to unreliable maps and GPS, since the low-level controller ultimately makes decisions based on egocentric image observations, using maps only as planning heuristics. For videos of our experiments, please check out our project page https://sites.google.com/view/viking-release.  ( 3 min )
    Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage. (arXiv:2107.06226v4 [cs.LG] UPDATED)
    We study model-based offline Reinforcement Learning with general function approximation without a full coverage assumption on the offline data distribution. We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO)which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate that this algorithmic framework can be applied to many specialized Markov Decision Processes where additional structural assumptions can further refine the concept of partial coverage. Two notable examples are: (1) low-rank MDP with representation learning where the partial coverage condition is defined using a relative condition number measured by the unknown ground truth feature representation; (2) factored MDP where the partial coverage condition is defined using density ratio based concentrability coefficients associated with individual factors.  ( 2 min )
    Mastering Diverse Domains through World Models. (arXiv:2301.04104v1 [cs.AI])
    General intelligence requires solving tasks across many domains. Current reinforcement learning algorithms carry this potential but are held back by the resources and knowledge required to tune them for new tasks. We present DreamerV3, a general and scalable algorithm based on world models that outperforms previous approaches across a wide range of domains with fixed hyperparameters. These domains include continuous and discrete actions, visual and low-dimensional inputs, 2D and 3D worlds, different data budgets, reward frequencies, and reward scales. We observe favorable scaling properties of DreamerV3, with larger models directly translating to higher data-efficiency and final performance. Applied out of the box, DreamerV3 is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula, a long-standing challenge in artificial intelligence. Our general algorithm makes reinforcement learning broadly applicable and allows scaling to hard decision making problems.  ( 2 min )
    Robust Deep Reinforcement Learning through Bootstrapped Opportunistic Curriculum. (arXiv:2206.10057v2 [cs.LG] UPDATED)
    Despite considerable advances in deep reinforcement learning, it has been shown to be highly vulnerable to adversarial perturbations to state observations. Recent efforts that have attempted to improve adversarial robustness of reinforcement learning can nevertheless tolerate only very small perturbations, and remain fragile as perturbation size increases. We propose Bootstrapped Opportunistic Adversarial Curriculum Learning (BCL), a novel flexible adversarial curriculum learning framework for robust reinforcement learning. Our framework combines two ideas: conservatively bootstrapping each curriculum phase with highest quality solutions obtained from multiple runs of the previous phase, and opportunistically skipping forward in the curriculum. In our experiments we show that the proposed BCL framework enables dramatic improvements in robustness of learned policies to adversarial perturbations. The greatest improvement is for Pong, where our framework yields robustness to perturbations of up to 25/255; in contrast, the best existing approach can only tolerate adversarial noise up to 5/255. Our code is available at: https://github.com/jlwu002/BCL.  ( 2 min )
    IronForge: An Open, Secure, Fair, Decentralized Federated Learning. (arXiv:2301.04006v1 [cs.LG])
    Federated learning (FL) provides an effective machine learning (ML) architecture to protect data privacy in a distributed manner. However, the inevitable network asynchrony, the over-dependence on a central coordinator, and the lack of an open and fair incentive mechanism collectively hinder its further development. We propose \textsc{IronForge}, a new generation of FL framework, that features a Directed Acyclic Graph (DAG)-based data structure and eliminates the need for central coordinators to achieve fully decentralized operations. \textsc{IronForge} runs in a public and open network, and launches a fair incentive mechanism by enabling state consistency in the DAG, so that the system fits in networks where training resources are unevenly distributed. In addition, dedicated defense strategies against prevalent FL attacks on incentive fairness and data privacy are presented to ensure the security of \textsc{IronForge}. Experimental results based on a newly developed testbed FLSim highlight the superiority of \textsc{IronForge} to the existing prevalent FL frameworks under various specifications in performance, fairness, and security. To the best of our knowledge, \textsc{IronForge} is the first secure and fully decentralized FL framework that can be applied in open networks with realistic network and training settings.  ( 2 min )
    Vision Transformers Are Good Mask Auto-Labelers. (arXiv:2301.03992v1 [cs.CV])
    We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4\% performance of fully supervised models. The best model achieves 44.1\% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.  ( 2 min )
    FOLD-SE: An Efficient Rule-based Machine Learning Algorithm with Scalable Explainability. (arXiv:2208.07912v2 [cs.LG] UPDATED)
    We present FOLD-SE, an efficient, explainable machine learning algorithm for classification tasks given tabular data containing numerical and categorical values. FOLD-SE generates a set of default rules-essentially a stratified normal logic program-as an (explainable) trained model. Explainability provided by FOLD-SE is scalable, meaning that regardless of the size of the dataset, the number of learned rules and learned literals stay quite small while good accuracy in classification is maintained. A model with smaller number of rules and literals is easier to understand for human beings. FOLD-SE is competitive with state-of-the-art machine learning algorithms such as XGBoost and Multi-Layer Perceptrons (MLP) wrt accuracy of prediction. However, unlike XGBoost and MLP, the FOLD-SE algorithm is explainable. The FOLD-SE algorithm builds upon our earlier work on developing the explainable FOLD-R++ machine learning algorithm for binary classification and inherits all of its positive features. Thus, pre-processing of the dataset, using techniques such as one-hot encoding, is not needed. Like FOLD-R++, FOLD-SE uses prefix sum to speed up computations resulting in FOLD-SE being an order of magnitude faster than XGBoost and MLP in execution speed. The FOLD-SE algorithm outperforms FOLD-R++ as well as other rule-learning algorithms such as RIPPER in efficiency, performance and scalability, especially for large datasets. A major reason for scalable explainability of FOLD-SE is the use of a literal selection heuristics based on Gini Impurity, as opposed to Information Gain used in FOLD-R++. A multi-category classification version of FOLD-SE is also presented.  ( 2 min )
    Smart Application for Fall Detection Using Wearable ECG & Accelerometer Sensors. (arXiv:2207.00008v2 [cs.HC] UPDATED)
    Timely and reliable detection of falls is a large and rapidly growing field of research due to the medical and financial demand of caring for a constantly growing elderly population. Within the past 2 decades, the availability of high-quality hardware (high-quality sensors and AI microchips) and software (machine learning algorithms) technologies has served as a catalyst for this research by giving developers the capabilities to develop such systems. This study developed multiple application components in order to investigate the development challenges and choices for fall detection systems, and provide materials for future research. The smart application developed using this methodology was validated by the results from fall detection modelling experiments and model mobile deployment. The best performing model overall was the ResNet152 on a standardised, and shuffled dataset with a 2s window size which achieved 92.8% AUC, 87.28% sensitivity, and 98.33% specificity. Given these results it is evident that accelerometer and ECG sensors are beneficial for fall detection, and allow for the discrimination between falls and other activities. This study leaves a significant amount of room for improvement due to weaknesses identified in the resultant dataset. These improvements include using a labelling protocol for the critical phase of a fall, increasing the number of dataset samples, improving the test subject representation, and experimenting with frequency domain preprocessing.  ( 2 min )
    Attribution-based Explanations that Provide Recourse Cannot be Robust. (arXiv:2205.15834v2 [stat.ML] UPDATED)
    Different users of machine learning methods require different explanations, depending on their goals. To make machine learning accountable to society, one important goal is to get actionable options for recourse, which allow an affected user to change the decision $f(x)$ of a machine learning system by making limited changes to its input $x$. We formalize this by providing a general definition of recourse sensitivity, which needs to be instantiated with a utility function that describes which changes to the decisions are relevant to the user. This definition applies to local attribution methods, which attribute an importance weight to each input feature. It is often argued that such local attributions should be robust, in the sense that a small change in the input $x$ that is being explained, should not cause a large change in the feature weights. However, we prove formally that it is in general impossible for any single attribution method to be both recourse sensitive and robust at the same time. It follows that there must always exist counterexamples to at least one of these properties. We provide such counterexamples for several popular attribution methods, including LIME, SHAP, Integrated Gradients and SmoothGrad. Our results also cover counterfactual explanations, which may be viewed as attributions that describe a perturbation of $x$. We further discuss possible ways to work around our impossibility result, for instance by allowing the output to consist of sets with multiple attributions, and we provide sufficient conditions for specific classes of continuous functions to be recourse sensitive. Finally, we strengthen our impossibility result for the restricted case where users are only able to change a single attribute of $x$, by providing an exact characterization of the functions $f$ to which impossibility applies.  ( 2 min )
    Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice. (arXiv:2210.03709v3 [cs.HC] UPDATED)
    Recent years have seen growing interest among both researchers and practitioners in user-engaged approaches to algorithm auditing, which directly engage users in detecting problematic behaviors in algorithmic systems. However, we know little about industry practitioners' current practices and challenges around user-engaged auditing, nor what opportunities exist for them to better leverage such approaches in practice. To investigate, we conducted a series of interviews and iterative co-design activities with practitioners who employ user-engaged auditing approaches in their work. Our findings reveal several challenges practitioners face in appropriately recruiting and incentivizing user auditors, scaffolding user audits, and deriving actionable insights from user-engaged audit reports. Furthermore, practitioners shared organizational obstacles to user-engaged auditing, surfacing a complex relationship between practitioners and user auditors. Based on these findings, we discuss opportunities for future HCI research to help realize the potential (and the mitigate risks) of user-engaged auditing in industry practice.  ( 2 min )
    ELIAS: End-to-End Learning to Index and Search in Large Output Spaces. (arXiv:2210.08410v2 [cs.LG] UPDATED)
    Extreme multi-label classification (XMC) is a popular framework for solving many real-world problems that require accurate prediction from a very large number of potential output choices. A popular approach for dealing with the large label space is to arrange the labels into a shallow tree-based index and then learn an ML model to efficiently search this index via beam search. Existing methods initialize the tree index by clustering the label space into a few mutually exclusive clusters based on pre-defined features and keep it fixed throughout the training procedure. This approach results in a sub-optimal indexing structure over the label space and limits the search performance to the quality of choices made during the initialization of the index. In this paper, we propose a novel method ELIAS which relaxes the tree-based index to a specialized weighted graph-based index which is learned end-to-end with the final task objective. More specifically, ELIAS models the discrete cluster-to-label assignments in the existing tree-based index as soft learnable parameters that are learned jointly with the rest of the ML model. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels. In particular, ELIAS can be up to 2.5% better at precision@1 and up to 4% better at recall@100 than existing XMC methods. A PyTorch implementation of ELIAS along with other resources is available at https://github.com/nilesh2797/ELIAS.
    Generating Accurate and Faithful Discharge Instructions: Task, Dataset, and Model. (arXiv:2210.12777v2 [cs.CL] UPDATED)
    The "Patient Instruction" (PI), known as "Discharge Instruction", which contains critical instructional information provided both to carers and to the patient at the time of discharge, is essential for the patient to manage their condition outside hospital. An accurate and easy-to-follow PI can improve the self-management of patients which can in turn reduce hospital readmission rates. However, writing an appropriate PI can be extremely time-consuming for physicians, and is subject to being incomplete or error-prone for (potentially overworked) physicians. Therefore, we propose a new task that can provide an objective means of avoiding incompleteness, while reducing clinical workload: the automatic generation of the PI, which is imagined as being a document that the clinician can review, modify, and approve as necessary (rather than taking the human "out of the loop"). We build a benchmark clinical dataset and propose the Re3Writer, which imitates the working patterns of physicians to first retrieve related working experience from historical PIs written by physicians, then reason related medical knowledge. Finally, it refines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the PI for previously-unseen patient according to their health records during hospitalization. Our experiments show that, using our method, the performance of five different models can be substantially boosted across all metrics, with up to 20%, 11%, and 19% relative improvements in BLEU-4, ROUGE-L, and METEOR, respectively. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of its usefulness for clinical practice. The code is available at https://github.com/AI-in-Hospitals/Patient-Instructions
    Bias-Aware Face Mask Detection Dataset. (arXiv:2211.01207v3 [cs.CV] UPDATED)
    In December 2019, a novel coronavirus (COVID-19) spread so quickly around the world that many countries had to set mandatory face mask rules in public areas to reduce the transmission of the virus. To monitor public adherence, researchers aimed to rapidly develop efficient systems that can detect faces with masks automatically. However, the lack of representative and novel datasets proved to be the biggest challenge. Early attempts to collect face mask datasets did not account for potential race, gender, and age biases. Therefore, the resulting models show inherent biases toward specific race groups, such as Asian or Caucasian. In this work, we present a novel face mask detection dataset that contains images posted on Twitter during the pandemic from around the world. Unlike previous datasets, the proposed Bias-Aware Face Mask Detection (BAFMD) dataset contains more images from underrepresented race and age groups to mitigate the problem for the face mask detection task. We perform experiments to investigate potential biases in widely used face mask detection datasets and illustrate that the BAFMD dataset yields models with better performance and generalization ability. The dataset is publicly available at https://github.com/Alpkant/BAFMD.
    Stars: Tera-Scale Graph Building for Clustering and Graph Learning. (arXiv:2212.02635v2 [cs.LG] UPDATED)
    A fundamental procedure in the analysis of massive datasets is the construction of similarity graphs. Such graphs play a key role for many downstream tasks, including clustering, classification, graph learning, and nearest neighbor search. For these tasks, it is critical to build graphs which are sparse yet still representative of the underlying data. The benefits of sparsity are twofold: firstly, constructing dense graphs is infeasible in practice for large datasets, and secondly, the runtime of downstream tasks is directly influenced by the sparsity of the similarity graph. In this work, we present $\textit{Stars}$: a highly scalable method for building extremely sparse graphs via two-hop spanners, which are graphs where similar points are connected by a path of length at most two. Stars can construct two-hop spanners with significantly fewer similarity comparisons, which are a major bottleneck for learning based models where comparisons are expensive to evaluate. Theoretically, we demonstrate that Stars builds a graph in nearly-linear time, where approximate nearest neighbors are contained within two-hop neighborhoods. In practice, we have deployed Stars for multiple data sets allowing for graph building at the $\textit{Tera-Scale}$, i.e., for graphs with tens of trillions of edges. We evaluate the performance of Stars for clustering and graph learning, and demonstrate 10~1000-fold improvements in pairwise similarity comparisons compared to different baselines, and 2~10-fold improvement in running time without quality loss.
    Dynamic Tensor Product Regression. (arXiv:2210.03961v2 [cs.DS] UPDATED)
    In this work, we initiate the study of \emph{Dynamic Tensor Product Regression}. One has matrices $A_1\in \mathbb{R}^{n_1\times d_1},\ldots,A_q\in \mathbb{R}^{n_q\times d_q}$ and a label vector $b\in \mathbb{R}^{n_1\ldots n_q}$, and the goal is to solve the regression problem with the design matrix $A$ being the tensor product of the matrices $A_1, A_2, \dots, A_q$ i.e. $\min_{x\in \mathbb{R}^{d_1\ldots d_q}}~\|(A_1\otimes \ldots\otimes A_q)x-b\|_2$. At each time step, one matrix $A_i$ receives a sparse change, and the goal is to maintain a sketch of the tensor product $A_1\otimes\ldots \otimes A_q$ so that the regression solution can be updated quickly. Recomputing the solution from scratch for each round is very slow and so it is important to develop algorithms which can quickly update the solution with the new design matrix. Our main result is a dynamic tree data structure where any update to a single matrix can be propagated quickly throughout the tree. We show that our data structure can be used to solve dynamic versions of not only Tensor Product Regression, but also Tensor Product Spline regression (which is a generalization of ridge regression) and for maintaining Low Rank Approximations for the tensor product.
    How Far Should We Look Back to Achieve Effective Real-Time Time-Series Anomaly Detection?. (arXiv:2102.06560v6 [cs.LG] UPDATED)
    Anomaly detection is the process of identifying unexpected events or ab-normalities in data, and it has been applied in many different areas such as system monitoring, fraud detection, healthcare, intrusion detection, etc. Providing real-time, lightweight, and proactive anomaly detection for time series with neither human intervention nor domain knowledge could be highly valuable since it reduces human effort and enables appropriate countermeasures to be undertaken before a disastrous event occurs. To our knowledge, RePAD (Real-time Proactive Anomaly Detection algorithm) is a generic approach with all above-mentioned features. To achieve real-time and lightweight detection, RePAD utilizes Long Short-Term Memory (LSTM) to detect whether or not each upcoming data point is anomalous based on short-term historical data points. However, it is unclear that how different amounts of historical data points affect the performance of RePAD. Therefore, in this paper, we investigate the impact of different amounts of historical data on RePAD by introducing a set of performance metrics that cover novel detection accuracy measures, time efficiency, readiness, and resource consumption, etc. Empirical experiments based on real-world time series datasets are conducted to evaluate RePAD in different scenarios, and the experimental results are presented and discussed.
    Improving Scheduled Sampling with Elastic Weight Consolidation for Neural Machine Translation. (arXiv:2109.06308v3 [cs.CL] UPDATED)
    Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. the discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and empirically successful approach which addresses this issue by incorporating model-generated prefixes into training. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that scheduled sampling, while it ameliorates exposure bias by increasing model reliance on the input sequence, worsens performance when the prefix at inference time is correct, a form of catastrophic forgetting. We propose to use Elastic Weight Consolidation to better balance mitigating exposure bias with retaining performance. Experiments on four IWSLT'14 and WMT'14 translation datasets demonstrate that our approach alleviates catastrophic forgetting and significantly outperforms maximum likelihood estimation and scheduled sampling baselines.
    ProxyBO: Accelerating Neural Architecture Search via Bayesian Optimization with Zero-cost Proxies. (arXiv:2110.10423v3 [cs.LG] UPDATED)
    Designing neural architectures requires immense manual efforts. This has promoted the development of neural architecture search (NAS) to automate the design. While previous NAS methods achieve promising results but run slowly, zero-cost proxies run extremely fast but are less promising. Therefore, it is of great potential to accelerate NAS via those zero-cost proxies. The existing method has two limitations, which are unforeseeable reliability and one-shot usage. To address the limitations, we present ProxyBO, an efficient Bayesian optimization (BO) framework that utilizes the zero-cost proxies to accelerate neural architecture search. We apply the generalization ability measurement to estimate the fitness of proxies on the task during each iteration and design a novel acquisition function to combine BO with zero-cost proxies based on their dynamic influence. Extensive empirical studies show that ProxyBO consistently outperforms competitive baselines on five tasks from three public benchmarks. Concretely, ProxyBO achieves up to 5.41x and 3.86x speedups over the state-of-the-art approaches REA and BRP-NAS.
    Adaptive Data Debiasing through Bounded Exploration. (arXiv:2110.13054v2 [cs.LG] UPDATED)
    Biases in existing datasets used to train algorithmic decision rules can raise ethical and economic concerns due to the resulting disparate treatment of different groups. We propose an algorithm for sequentially debiasing such datasets through adaptive and bounded exploration in a classification problem with costly and censored feedback. Exploration in this context means that at times, and to a judiciously-chosen extent, the decision maker deviates from its (current) loss-minimizing rule, and instead accepts some individuals that would otherwise be rejected, so as to reduce statistical data biases. Our proposed algorithm includes parameters that can be used to balance between the ultimate goal of removing data biases -- which will in turn lead to more accurate and fair decisions, and the exploration risks incurred to achieve this goal. We analytically show that such exploration can help debias data in certain distributions. We further investigate how fairness criteria can work in conjunction with our data debiasing algorithm. We illustrate the performance of our algorithm using experiments on synthetic and real-world datasets.
    HierarchicalForecast: A Reference Framework for Hierarchical Forecasting in Python. (arXiv:2207.03517v4 [stat.ML] UPDATED)
    Large collections of time series data are commonly organized into structures with different levels of aggregation; examples include product and geographical groupings. It is often important to ensure that the forecasts are coherent so that the predicted values at disaggregate levels add up to the aggregate forecast. The growing interest of the Machine Learning community in hierarchical forecasting systems indicates that we are in a propitious moment to ensure that scientific endeavors are grounded on sound baselines. For this reason, we put forward the HierarchicalForecast library, which contains preprocessed publicly available datasets, evaluation metrics, and a compiled set of statistical baseline models. Our Python-based reference framework aims to bridge the gap between statistical and econometric modeling, and Machine Learning forecasting research. Code and documentation are available in https://github.com/Nixtla/hierarchicalforecast.
    AniWho : A Quick and Accurate Way to Classify Anime Character Faces in Images. (arXiv:2208.11012v3 [cs.CV] UPDATED)
    In order to classify Japanese animation-style character faces, this paper attempts to delve further into the many models currently available, including InceptionV3, InceptionResNetV2, MobileNetV2, and EfficientNet, employing transfer learning. This paper demonstrates that EfficientNet-B7, which achieves a top-1 accuracy of 85.08%, has the highest accuracy rate. MobileNetV2, which achieves a less accurate result with a top-1 accuracy of 81.92%, benefits from a significantly faster inference time and fewer required parameters. However, from the experiment, MobileNet-V2 is prone to overfitting; EfficienNet-B0 fixed the overfitting issue but with a cost of a little slower in inference time than MobileNet-V2 but a little more accurate result, top-1 accuracy of 83.46%. This paper also uses a few-shot learning architecture called Prototypical Networks, which offers an adequate substitute for conventional transfer learning techniques.
    MixGen: A New Multi-Modal Data Augmentation. (arXiv:2206.08358v3 [cs.CV] UPDATED)
    Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual reasoning (+$0.9% on NLVR2), visual question answering (+0.3% on VQA2.0), and visual entailment (+0.4% on SNLI-VE).
    Toward a `Standard Model' of Machine Learning. (arXiv:2108.07783v2 [cs.LG] UPDATED)
    Machine learning (ML) is about computational methods that enable machines to learn concepts from experience. In handling a wide variety of experience ranging from data instances, knowledge, constraints, to rewards, adversaries, and lifelong interaction in an ever-growing spectrum of tasks, contemporary ML/AI (artificial intelligence) research has resulted in a multitude of learning paradigms and methodologies. Despite the continual progresses on all different fronts, the disparate narrowly focused methods also make standardized, composable, and reusable development of ML approaches difficult, and preclude the opportunity to build AI agents that panoramically learn from all types of experience. This article presents a standardized ML formalism, in particular a `standard equation' of the learning objective, that offers a unifying understanding of many important ML algorithms in the supervised, unsupervised, knowledge-constrained, reinforcement, adversarial, and online learning paradigms, respectively -- those diverse algorithms are encompassed as special cases due to different choices of modeling components. The framework also provides guidance for mechanical design of new ML approaches and serves as a promising vehicle toward panoramic machine learning with all experience.
    Discriminator-Guided Model-Based Offline Imitation Learning. (arXiv:2207.00244v3 [cs.LG] UPDATED)
    Offline imitation learning (IL) is a powerful method to solve decision-making problems from expert demonstrations without reward labels. Existing offline IL methods suffer from severe performance degeneration under limited expert data. Including a learned dynamics model can potentially improve the state-action space coverage of expert data, however, it also faces challenging issues like model approximation/generalization errors and suboptimality of rollout data. In this paper, we propose the Discriminator-guided Model-based offline Imitation Learning (DMIL) framework, which introduces a discriminator to simultaneously distinguish the dynamics correctness and suboptimality of model rollout data against real expert demonstrations. DMIL adopts a novel cooperative-yet-adversarial learning strategy, which uses the discriminator to guide and couple the learning process of the policy and dynamics model, resulting in improved model performance and robustness. Our framework can also be extended to the case when demonstrations contain a large proportion of suboptimal data. Experimental results show that DMIL and its extension achieve superior performance and robustness compared to state-of-the-art offline IL methods under small datasets.
    A Decomposition-Based Hybrid Ensemble CNN Framework for Driver Fatigue Recognition. (arXiv:2203.09477v2 [eess.SP] UPDATED)
    Electroencephalogram (EEG) has become increasingly popular in driver fatigue monitoring systems. Several decomposition methods have been attempted to analyze the EEG signals that are complex, nonlinear and non-stationary and improve the EEG decoding performance in different applications. However, it remains challenging to extract more distinguishable features from different decomposed components for driver fatigue recognition. In this work, we propose a novel decomposition-based hybrid ensemble convolutional neural network (CNN) framework to enhance the capability of decoding EEG signals. Four decomposition methods are employed to disassemble the EEG signals into components of different complexity. Instead of handcraft features, the CNNs in this framework directly learn from the decomposed components. In addition, a component-specific batch normalization layer is employed to reduce subject variability. Moreover, we employ two ensemble modes to integrate the outputs of all CNNs, comprehensively exploiting the diverse information of the decomposed components. Against the challenging cross-subject driver fatigue recognition task, the models under the framework all showed superior performance to the strong baselines. Specifically, the performance of different decomposition methods and ensemble modes was further compared. The results indicated that discrete wavelet transform-based ensemble CNN achieved the highest average classification accuracy of 83.48% among the compared methods. The proposed framework can be extended to any CNN architecture and be applied to any EEG-related tasks, opening the possibility of extracting more beneficial features from complex EEG data.
    Value Cards: An Educational Toolkit for Teaching Social Impacts of Machine Learning through Deliberation. (arXiv:2010.11411v3 [cs.CY] UPDATED)
    Recently, there have been increasing calls for computer science curricula to complement existing technical training with topics related to Fairness, Accountability, Transparency, and Ethics. In this paper, we present Value Card, an educational toolkit to inform students and practitioners of the social impacts of different machine learning models via deliberation. This paper presents an early use of our approach in a college-level computer science course. Through an in-class activity, we report empirical data for the initial effectiveness of our approach. Our results suggest that the use of the Value Cards toolkit can improve students' understanding of both the technical definitions and trade-offs of performance metrics and apply them in real-world contexts, help them recognize the significance of considering diverse social values in the development of deployment of algorithmic systems, and enable them to communicate, negotiate and synthesize the perspectives of diverse stakeholders. Our study also demonstrates a number of caveats we need to consider when using the different variants of the Value Cards toolkit. Finally, we discuss the challenges as well as future applications of our approach.
    Convergence of Deep ReLU Networks. (arXiv:2107.12530v3 [cs.LG] UPDATED)
    We explore convergence of deep neural networks with the popular ReLU activation function, as the depth of the networks tends to infinity. To this end, we introduce the notion of activation domains and activation matrices of a ReLU network. By replacing applications of the ReLU activation function by multiplications with activation matrices on activation domains, we obtain an explicit expression of the ReLU network. We then identify the convergence of the ReLU networks as convergence of a class of infinite products of matrices. Sufficient and necessary conditions for convergence of these infinite products of matrices are studied. As a result, we establish necessary conditions for ReLU networks to converge that the sequence of weight matrices converges to the identity matrix and the sequence of the bias vectors converges to zero as the depth of ReLU networks increases to infinity. Moreover, we obtain sufficient conditions in terms of the weight matrices and bias vectors at hidden layers for pointwise convergence of deep ReLU networks. These results provide mathematical insights to the design strategy of the well-known deep residual networks in image classification.
    Reconstructing Sparse Multiplex Networks with Application to Covert Networks. (arXiv:2208.01739v3 [cs.SI] UPDATED)
    Network structure provides critical information for understanding the dynamic behavior of networks. However, the complete structure of real-world networks is often unavailable, thus it is crucially important to develop approaches to infer a more complete structure of networks. In this paper, we integrate the configuration model for generating random networks into an Expectation-Maximization-Aggregation (EMA) framework to reconstruct the complete structure of multiplex networks. We validate the proposed EMA framework against the random model on several real-world multiplex networks, including both covert and overt ones. It is found that the EMA framework generally achieves the best predictive accuracy compared to the EM framework and the random model. As the number of layers increases, the performance improvement of EMA over EM decreases. The inferred multiplex networks can be leveraged to inform the decision-making on monitoring covert networks as well as allocating limited resources for collecting additional information to improve reconstruction accuracy. For law enforcement agencies, the inferred complete network structure can be used to develop more effective strategies for covert network interdiction.
    Predictive Process Model Monitoring using Recurrent Neural Networks. (arXiv:2011.02819v3 [cs.LG] UPDATED)
    The field of predictive process monitoring focuses on case-level models to predict a single specific outcome such as a particular objective, (remaining) time, or next activity/remaining sequence. Recently, a longer-horizon, model-wide approach has been proposed in the form of process model forecasting, which predicts the future state of a whole process model through the forecasting of all activity-to-activity relations at once using time series forecasting. This paper introduces the concept of \emph{predictive process model monitoring} which sits in the middle of both predictive process monitoring and process model forecasting. Concretely, by modelling a process model as a set of constraints being present between activities over time, we can capture more detailed information between activities compared to process model forecasting, while being compatible with typical predictive process monitoring objectives which are often expressed in the same language as these constraints. To achieve this, Processes-As-Movies (PAM) is introduced, i.e., a novel technique capable of jointly mining and predicting declarative process constraints between activities in various windows of a process' execution. PAM predicts what declarative rules hold for a trace (objective-based), which also supports the prediction of all constraints together as a process model (model-based). Various recurrent neural network topologies inspired by video analysis tailored to temporal high-dimensional input are used to model the process model evolution with windows as time steps, including encoder-decoder long short-term memory networks, and convolutional long short-term memory networks. Results obtained over real-life event logs show that these topologies are effective in terms of predictive accuracy and precision.
    Exploring How Machine Learning Practitioners (Try To) Use Fairness Toolkits. (arXiv:2205.06922v2 [cs.HC] UPDATED)
    Recent years have seen the development of many open-source ML fairness toolkits aimed at helping ML practitioners assess and address unfairness in their systems. However, there has been little research investigating how ML practitioners actually use these toolkits in practice. In this paper, we conducted the first in-depth empirical exploration of how industry practitioners (try to) work with existing fairness toolkits. In particular, we conducted think-aloud interviews to understand how participants learn about and use fairness toolkits, and explored the generality of our findings through an anonymous online survey. We identified several opportunities for fairness toolkits to better address practitioner needs and scaffold them in using toolkits effectively and responsibly. Based on these findings, we highlight implications for the design of future open-source fairness toolkits that can support practitioners in better contextualizing, communicating, and collaborating around ML fairness efforts.
    Sampling random graph homomorphisms and applications to network data analysis. (arXiv:1910.09483v3 [math.PR] UPDATED)
    A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels.
    Calibrated simplex-mapping classification. (arXiv:2103.02926v2 [stat.ML] UPDATED)
    We propose a novel methodology for general multi-class classification in arbitrary feature spaces, which results in a potentially well-calibrated classifier. Calibrated classifiers are important in many applications because, in addition to the prediction of mere class labels, they also yield a confidence level for each of their predictions. In essence, the training of our classifier proceeds in two steps. In a first step, the training data is represented in a latent space whose geometry is induced by a regular $(n-1)$-dimensional simplex, $n$ being the number of classes. We design this representation in such a way that it well reflects the feature space distances of the datapoints to their own- and foreign-class neighbors. In a second step, the latent space representation of the training data is extended to the whole feature space by fitting a regression model to the transformed data. With this latent-space representation, our calibrated classifier is readily defined. We rigorously establish its core theoretical properties and benchmark its prediction and calibration properties by means of various synthetic and real-world data sets from different application domains.
    Towards Understanding Quality Challenges of the Federated Learning for Neural Networks: A First Look from the Lens of Robustness. (arXiv:2201.01409v2 [cs.LG] UPDATED)
    Federated learning (FL) is a distributed learning paradigm that preserves users' data privacy while leveraging the entire dataset of all participants. In FL, multiple models are trained independently on the clients and aggregated centrally to update a global model in an iterative process. Although this approach is excellent at preserving privacy, FL still suffers from quality issues such as attacks or byzantine faults. Recent attempts have been made to address such quality challenges on the robust aggregation techniques for FL. However, the effectiveness of state-of-the-art (SOTA) robust FL techniques is still unclear and lacks a comprehensive study. Therefore, to better understand the current quality status and challenges of these SOTA FL techniques in the presence of attacks and faults, we perform a large-scale empirical study to investigate the SOTA FL's quality from multiple angles of attacks, simulated faults (via mutation operators), and aggregation (defense) methods. In particular, we study FL's performance on the image classification tasks and use DNNs as our model type. Furthermore, we perform our study on two generic image datasets and one real-world federated medical image dataset. We also investigate the effect of the proportion of affected clients and the dataset distribution factors on the robustness of FL. After a large-scale analysis with 496 configurations, we find that most mutators on each user have a negligible effect on the final model in the generic datasets, and only one of them is effective in the medical dataset. Furthermore, we show that model poisoning attacks are more effective than data poisoning attacks. Moreover, choosing the most robust FL aggregator depends on the attacks and datasets. Finally, we illustrate that a simple ensemble of aggregators achieves a more robust solution than any single aggregator and is the best choice in 75% of the cases.
    BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm. (arXiv:2301.04120v1 [cs.NE])
    The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable distribution. We converted the script into a speech corpus using two text-to-speech systems. Using the designed speech corpus, we tested the performances of speech enhancement (SE) and automatic speech recognition (ASR), which are one of the most important regression- and classification-based speech processing tasks, respectively. The experimental results show that the SE and ASR models trained on the designed speech corpus outperform their counterparts trained on a randomly composed speech corpus.  ( 2 min )
    Towards AI-controlled FES-restoration of arm movements: neuromechanics-based reinforcement learning for 3-D reaching. (arXiv:2301.04004v1 [eess.SY])
    Reaching disabilities affect the quality of life. Functional Electrical Stimulation (FES) can restore lost motor functions. Yet, there remain challenges in controlling FES to induce desired movements. Neuromechanical models are valuable tools for developing FES control methods. However, focusing on the upper extremity areas, several existing models are either overly simplified or too computationally demanding for control purposes. Besides the model-related issues, finding a general method for governing the control rules for different tasks and subjects remains an engineering challenge. Here, we present our approach toward FES-based restoration of arm movements to address those fundamental issues in controlling FES. Firstly, we present our surface-FES-oriented neuromechanical models of human arms built using well-accepted, open-source software. The models are designed to capture significant dynamics in FES controls with minimal computational cost. Our models are customisable and can be used for testing different control methods. Secondly, we present the application of reinforcement learning (RL) as a general method for governing the control rules. In combination, our customisable models and RL-based control method open the possibility of delivering customised FES controls for different subjects and settings with minimal engineering intervention. We demonstrate our approach in planar and 3D settings.  ( 2 min )
    A Dietary Nutrition-aided Healthcare Platform via Effective Food Recognition on a Localized Singaporean Food Dataset. (arXiv:2301.03829v1 [cs.LG])
    Localized food datasets have profound meaning in revealing a country's special cuisines to explore people's dietary behaviors, which will shed light on their health conditions and disease development. In this paper, revolving around the demand for accurate food recognition in Singapore, we develop the FoodSG platform to incubate diverse healthcare-oriented applications as a service in Singapore, taking into account their shared requirements. We release a localized Singaporean food dataset FoodSG-233 with a systematic cleaning and curation pipeline for promoting future data management research in food computing. To overcome the hurdle in recognition performance brought by Singaporean multifarious food dishes, we propose to integrate supervised contrastive learning into our food recognition model FoodSG-SCL for the intrinsic capability to mine hard positive/negative samples and therefore boost the accuracy. Through a comprehensive evaluation, we share the insightful experience with practitioners in the data management community regarding food-related data-intensive healthcare applications. The FoodSG-233 dataset can be accessed via: https://foodlg.comp.nus.edu.sg/.  ( 2 min )
    Imbalanced Classification In Faulty Turbine Data: New Proximal Policy Optimization. (arXiv:2301.04049v1 [eess.SY])
    There is growing importance to detecting faults and implementing the best methods in industrial and real-world systems. We are searching for the most trustworthy and practical data-based fault detection methods proposed by artificial intelligence applications. In this paper, we propose a framework for fault detection based on reinforcement learning and a policy known as proximal policy optimization. As a result of the lack of fault data, one of the significant problems with the traditional policy is its weakness in detecting fault classes, which was addressed by changing the cost function. Using modified Proximal Policy Optimization, we can increase performance, overcome data imbalance, and better predict future faults. When our modified policy is implemented, all evaluation metrics will increase by $3\%$ to $4\%$ as compared to the traditional policy in the first benchmark, between $20\%$ and $55\%$ in the second benchmark, and between $6\%$ and $14\%$ in the third benchmark, as well as an improvement in performance and prediction speed compared to previous methods.  ( 2 min )
    There is No Big Brother or Small Brother: Knowledge Infusion in Language Models for Link Prediction and Question Answering. (arXiv:2301.04013v1 [cs.CL])
    The integration of knowledge graphs with deep learning is thriving in improving the performance of various natural language processing (NLP) tasks. In this paper, we focus on knowledge-infused link prediction and question answering using language models, T5, and BLOOM across three domains: Aviation, Movie, and Web. In this context, we infuse knowledge in large and small language models and study their performance, and find the performance to be similar. For the link prediction task on the Aviation Knowledge Graph, we obtain a 0.2 hits@1 score using T5-small, T5-base, T5-large, and BLOOM. Using template-based scripts, we create a set of 1 million synthetic factoid QA pairs in the aviation domain from National Transportation Safety Board (NTSB) reports. On our curated QA pairs, the three models of T5 achieve a 0.7 hits@1 score. We validate out findings with the paired student t-test and Cohen's kappa scores. For link prediction on Aviation Knowledge Graph using T5-small and T5-large, we obtain a Cohen's kappa score of 0.76, showing substantial agreement between the models. Thus, we infer that small language models perform similar to large language models with the infusion of knowledge.  ( 2 min )
    Manifold Restricted Interventional Shapley Values. (arXiv:2301.04041v1 [stat.ML])
    Shapley values are model-agnostic methods for explaining model predictions. Many commonly used methods of computing Shapley values, known as \emph{off-manifold methods}, rely on model evaluations on out-of-distribution input samples. Consequently, explanations obtained are sensitive to model behaviour outside the data distribution, which may be irrelevant for all practical purposes. While \emph{on-manifold methods} have been proposed which do not suffer from this problem, we show that such methods are overly dependent on the input data distribution, and therefore result in unintuitive and misleading explanations. To circumvent these problems, we propose \emph{ManifoldShap}, which respects the model's domain of validity by restricting model evaluations to the data manifold. We show, theoretically and empirically, that ManifoldShap is robust to off-manifold perturbations of the model and leads to more accurate and intuitive explanations than existing state-of-the-art Shapley methods.  ( 2 min )
    Sentiment-based Engagement Strategies for intuitive Human-Robot Interaction. (arXiv:2301.03867v1 [cs.RO])
    Emotion expressions serve as important communicative signals and are crucial cues in intuitive interactions between humans. Hence, it is essential to include these fundamentals in robotic behavior strategies when interacting with humans to promote mutual understanding and to reduce misjudgements. We tackle this challenge by detecting and using the emotional state and attention for a sentiment analysis of potential human interaction partners to select well-adjusted engagement strategies. This way, we pave the way for more intuitive human-robot interactions, as the robot's action conforms to the person's mood and expectation. We propose four different engagement strategies with implicit and explicit communication techniques that we implement on a mobile robot platform for initial experiments.  ( 2 min )
    Quantifying Assurance in Learning-enabled Systems. (arXiv:2006.10345v1 [cs.SE] CROSS LISTED)
    Dependability assurance of systems embedding machine learning(ML) components---so called learning-enabled systems (LESs)---is a key step for their use in safety-critical applications. In emerging standardization and guidance efforts, there is a growing consensus in the value of using assurance cases for that purpose. This paper develops a quantitative notion of assurance that an LES is dependable, as a core component of its assurance case, also extending our prior work that applied to ML components. Specifically, we characterize LES assurance in the form of assurance measures: a probabilistic quantification of confidence that an LES possesses system-level properties associated with functional capabilities and dependability attributes. We illustrate the utility of assurance measures by application to a real world autonomous aviation system, also describing their role both in i) guiding high-level, runtime risk mitigation decisions and ii) as a core component of the associated dynamic assurance case.  ( 2 min )
    Sharing pattern submodels for prediction with missing values. (arXiv:2206.11161v2 [cs.LG] UPDATED)
    Missing values are unavoidable in many applications of machine learning and present challenges both during training and at test time. When variables are missing in recurring patterns, fitting separate pattern submodels have been proposed as a solution. However, fitting models independently does not make efficient use of all available data. Conversely, fitting a single shared model to the full data set relies on imputation which often leads to biased results when missingness depends on unobserved factors. We propose an alternative approach, called sharing pattern submodels, which i) makes predictions that are robust to missing values at test time, ii) maintains or improves the predictive power of pattern submodels, and iii) has a short description, enabling improved interpretability. Parameter sharing is enforced through sparsity-inducing regularization which we prove leads to consistent estimation. Finally, we give conditions for when a sharing model is optimal, even when both missingness and the target outcome depend on unobserved variables. Classification and regression experiments on synthetic and real-world data sets demonstrate that our models achieve a favorable tradeoff between pattern specialization and information sharing.  ( 2 min )
    Actor-Director-Critic: A Novel Deep Reinforcement Learning Framework. (arXiv:2301.03887v1 [cs.LG])
    In this paper, we propose actor-director-critic, a new framework for deep reinforcement learning. Compared with the actor-critic framework, the director role is added, and action classification and action evaluation are applied simultaneously to improve the decision-making performance of the agent. Firstly, the actions of the agent are divided into high quality actions and low quality actions according to the rewards returned from the environment. Then, the director network is trained to have the ability to discriminate high and low quality actions and guide the actor network to reduce the repetitive exploration of low quality actions in the early stage of training. In addition, we propose an improved double estimator method to better solve the problem of overestimation in the field of reinforcement learning. For the two critic networks used, we design two target critic networks for each critic network instead of one. In this way, the target value of each critic network can be calculated by taking the average of the outputs of the two target critic networks, which is more stable and accurate than using only one target critic network to obtain the target value. In order to verify the performance of the actor-director-critic framework and the improved double estimator method, we applied them to the TD3 algorithm to improve the TD3 algorithm. Then, we carried out experiments in multiple environments in MuJoCo and compared the experimental data before and after the algorithm improvement. The final experimental results show that the improved algorithm can achieve faster convergence speed and higher total return.  ( 2 min )
    RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration. (arXiv:2301.03904v1 [cs.AR])
    The increasing interest in TinyML, i.e., near-sensor machine learning on power budgets of a few tens of mW, is currently pushing toward enabling TinyML-class training as opposed to inference only. Current training algorithms, based on various forms of error and gradient backpropagation, rely on floating-point matrix operations to meet the precision and dynamic range requirements. So far, the energy and power cost of these operations has been considered too high for TinyML scenarios. This paper addresses the open challenge of near-sensor training on a few mW power budget and presents RedMulE - Reduced-Precision Matrix Multiplication Engine, a low-power specialized accelerator conceived for multi-precision floating-point General Matrix-Matrix Operations (GEMM-Ops) acceleration, supporting FP16, as well as hybrid FP8 formats, with {sign, exponent, mantissa}=({1,4,3}, {1,5,2}). We integrate RedMule into a Parallel Ultra-Low-Power (PULP) cluster containing eight energy-efficient RISC-V cores sharing a tightly-coupled data memory and implement the resulting system in a 22 nm technology. At its best efficiency point (@ 470 MHz, 0.65 V), the RedMulE-augmented PULP cluster achieves 755 GFLOPS/W and 920 GFLOPS/W during regular General Matrix-Matrix Multiplication (GEMM), and up to 1.19 TFLOPS/W and 1.67 TFLOPS/W when executing GEMM-Ops, respectively, for FP16 and FP8 input/output tensors. In its best performance point (@ 613 MHz, 0.8 V), RedMulE achieves up to 58.5 GFLOPS and 117 GFLOPS for FP16 and FP8, respectively, with 99.4% utilization of the array of Computing Elements and consuming less than 60 mW on average, thus enabling on-device training of deep learning models in TinyML application scenarios while retaining the flexibility to tackle other classes of common linear algebra problems efficiently.  ( 2 min )
    Neighborhood-Regularized Self-Training for Learning with Few Labels. (arXiv:2301.03726v1 [cs.LG])
    Training deep neural networks (DNNs) with limited supervision has been a popular research topic as it can significantly alleviate the annotation burden. Self-training has been successfully applied in semi-supervised learning tasks, but one drawback of self-training is that it is vulnerable to the label noise from incorrect pseudo labels. Inspired by the fact that samples with similar labels tend to share similar representations, we develop a neighborhood-based sample selection approach to tackle the issue of noisy pseudo labels. We further stabilize self-training via aggregating the predictions from different rounds during sample selection. Experiments on eight tasks show that our proposed method outperforms the strongest self-training baseline with 1.83% and 2.51% performance gain for text and graph datasets on average. Our further analysis demonstrates that our proposed data selection strategy reduces the noise of pseudo labels by 36.8% and saves 57.3% of the time when compared with the best baseline. Our code and appendices will be uploaded to https://github.com/ritaranx/NeST.  ( 2 min )
    Min-Max Optimization Made Simple: Approximating the Proximal Point Method via Contraction Maps. (arXiv:2301.03931v1 [cs.GT])
    In this paper we present a first-order method that admits near-optimal convergence rates for convex/concave min-max problems while requiring a simple and intuitive analysis. Similarly to the seminal work of Nemirovski and the recent approach of Piliouras et al. in normal form games, our work is based on the fact that the update rule of the Proximal Point method (PP) can be approximated up to accuracy $\epsilon$ with only $\mathcal{O}(\log 1/\epsilon)$ additional gradient-calls through the iterations of a contraction map. Then combining the analysis of (PP) method with an error-propagation analysis we establish that the resulting first order method, called \textit{Clairvoyant Extra Gradient}, admits near-optimal time-average convergence for general domains and last-iterate convergence in the unconstrained case.  ( 2 min )
    Is Federated Learning a Practical PET Yet?. (arXiv:2301.04017v1 [cs.CR])
    Federated learning (FL) is a framework for users to jointly train a machine learning model. FL is promoted as a privacy-enhancing technology (PET) that provides data minimization: data never "leaves" personal devices and users share only model updates with a server (e.g., a company) coordinating the distributed training. We assess the realistic (i.e., worst-case) privacy guarantees that are provided to users who are unable to trust the server. To this end, we propose an attack against FL protected with distributed differential privacy (DDP) and secure aggregation (SA). The attack method is based on the introduction of Sybil devices that deviate from the protocol to expose individual users' data for reconstruction by the server. The underlying root cause for the vulnerability to our attack is the power imbalance. The server orchestrates the whole protocol and users are given little guarantees about the selection of other users participating in the protocol. Moving forward, we discuss requirements for an FL protocol to guarantee DDP without asking users to trust the server. We conclude that such systems are not yet practical.  ( 2 min )
    On adversarial robustness and the use of Wasserstein ascent-descent dynamics to enforce it. (arXiv:2301.03662v1 [cs.LG])
    We propose iterative algorithms to solve adversarial problems in a variety of supervised learning settings of interest. Our algorithms, which can be interpreted as suitable ascent-descent dynamics in Wasserstein spaces, take the form of a system of interacting particles. These interacting particle dynamics are shown to converge toward appropriate mean-field limit equations in certain large number of particles regimes. In turn, we prove that, under certain regularity assumptions, these mean-field equations converge, in the large time limit, toward approximate Nash equilibria of the original adversarial learning problems. We present results for nonconvex-nonconcave settings, as well as for nonconvex-concave ones. Numerical experiments illustrate our results.  ( 2 min )
    SantaCoder: don't reach for the stars!. (arXiv:2301.03988v1 [cs.SE])
    The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.  ( 2 min )
    Look Beyond Bias with Entropic Adversarial Data Augmentation. (arXiv:2301.03844v1 [cs.LG])
    Deep neural networks do not discriminate between spurious and causal patterns, and will only learn the most predictive ones while ignoring the others. This shortcut learning behaviour is detrimental to a network's ability to generalize to an unknown test-time distribution in which the spurious correlations do not hold anymore. Debiasing methods were developed to make networks robust to such spurious biases but require to know in advance if a dataset is biased and make heavy use of minority counterexamples that do not display the majority bias of their class. In this paper, we argue that such samples should not be necessarily needed because the ''hidden'' causal information is often also contained in biased images. To study this idea, we propose 3 publicly released synthetic classification benchmarks, exhibiting predictive classification shortcuts, each of a different and challenging nature, without any minority samples acting as counterexamples. First, we investigate the effectiveness of several state-of-the-art strategies on our benchmarks and show that they do not yield satisfying results on them. Then, we propose an architecture able to succeed on our benchmarks, despite their unusual properties, using an entropic adversarial data augmentation training scheme. An encoder-decoder architecture is tasked to produce images that are not recognized by a classifier, by maximizing the conditional entropy of its outputs, and keep as much as possible of the initial content. A precise control of the information destroyed, via a disentangling process, enables us to remove the shortcut and leave everything else intact. Furthermore, results competitive with the state-of-the-art on the BAR dataset ensure the applicability of our method in real-life situations.  ( 2 min )
    Proceedings of the NeurIPS 2021 Workshop on Machine Learning for the Developing World: Global Challenges. (arXiv:2301.04007v1 [cs.LG])
    These are the proceedings of the 5th workshop on Machine Learning for the Developing World (ML4D), held as part of the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS) on December 14th, 2021.  ( 2 min )
    Hint assisted reinforcement learning: an application in radio astronomy. (arXiv:2301.03933v1 [astro-ph.IM])
    Model based reinforcement learning has proven to be more sample efficient than model free methods. On the other hand, the construction of a dynamics model in model based reinforcement learning has increased complexity. Data processing tasks in radio astronomy are such situations where the original problem which is being solved by reinforcement learning itself is the creation of a model. Fortunately, many methods based on heuristics or signal processing do exist to perform the same tasks and we can leverage them to propose the best action to take, or in other words, to provide a `hint'. We propose to use `hints' generated by the environment as an aid to the reinforcement learning process mitigating the complexity of model construction. We modify the soft actor critic algorithm to use hints and use the alternating direction method of multipliers algorithm with inequality constraints to train the agent. Results in several environments show that we get the increased sample efficiency by using hints as compared to model free methods.  ( 2 min )
    Community Detection with Known, Unknown, or Partially Known Auxiliary Latent Variables. (arXiv:2301.04088v1 [cs.SI])
    Empirical observations suggest that in practice, community membership does not completely explain the dependency between the edges of an observation graph. The residual dependence of the graph edges are modeled in this paper, to first order, by auxiliary node latent variables that affect the statistics of the graph edges but carry no information about the communities of interest. We then study community detection in graphs obeying the stochastic block model and censored block model with auxiliary latent variables. We analyze the conditions for exact recovery when these auxiliary latent variables are unknown, representing unknown nuisance parameters or model mismatch. We also analyze exact recovery when these secondary latent variables have been either fully or partially revealed. Finally, we propose a semidefinite programming algorithm for recovering the desired labels when the secondary labels are either known or unknown. We show that exact recovery is possible by semidefinite programming down to the respective maximum likelihood exact recovery threshold.
    Neural Radiance Field Codebooks. (arXiv:2301.04101v1 [cs.CV])
    Compositional representations of the world are a promising step towards enabling high-level scene understanding and efficient transfer to downstream tasks. Learning such representations for complex scenes and tasks remains an open challenge. Towards this goal, we introduce Neural Radiance Field Codebooks (NRC), a scalable method for learning object-centric representations through novel view reconstruction. NRC learns to reconstruct scenes from novel views using a dictionary of object codes which are decoded through a volumetric renderer. This enables the discovery of reoccurring visual and geometric patterns across scenes which are transferable to downstream tasks. We show that NRC representations transfer well to object navigation in THOR, outperforming 2D and 3D representation learning methods by 3.1% success rate. We demonstrate that our approach is able to perform unsupervised segmentation for more complex synthetic (THOR) and real scenes (NYU Depth) better than prior methods (29% relative improvement). Finally, we show that NRC improves on the task of depth ordering by 5.5% accuracy in THOR.
    Privacy-Preserving Record Linkage for Cardinality Counting. (arXiv:2301.04000v1 [cs.CR])
    Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget {\epsilon} = 1.0 compared to the state-of-the-art fuzzy matching and clustering method.
    On the Robustness of AlphaFold: A COVID-19 Case Study. (arXiv:2301.04093v1 [cs.LG])
    Protein folding neural networks (PFNNs) such as AlphaFold predict remarkably accurate structures of proteins compared to other approaches. However, the robustness of such networks has heretofore not been explored. This is particularly relevant given the broad social implications of such technologies and the fact that biologically small perturbations in the protein sequence do not generally lead to drastic changes in the protein structure. In this paper, we demonstrate that AlphaFold does not exhibit such robustness despite its high accuracy. This raises the challenge of detecting and quantifying the extent to which these predicted protein structures can be trusted. To measure the robustness of the predicted structures, we utilize (i) the root-mean-square deviation (RMSD) and (ii) the Global Distance Test (GDT) similarity measure between the predicted structure of the original sequence and the structure of its adversarially perturbed version. We prove that the problem of minimally perturbing protein sequences to fool protein folding neural networks is NP-complete. Based on the well-established BLOSUM62 sequence alignment scoring matrix, we generate adversarial protein sequences and show that the RMSD between the predicted protein structure and the structure of the original sequence are very large when the adversarial changes are bounded by (i) 20 units in the BLOSUM62 distance, and (ii) five residues (out of hundreds or thousands of residues) in the given protein sequence. In our experimental evaluation, we consider 111 COVID-19 proteins in the Universal Protein resource (UniProt), a central resource for protein data managed by the European Bioinformatics Institute, Swiss Institute of Bioinformatics, and the US Protein Information Resource. These result in an overall GDT similarity test score average of around 34%, demonstrating a substantial drop in the performance of AlphaFold.
    Temporal Weights. (arXiv:2301.04126v1 [cs.NE])
    In artificial neural networks, weights are a static representation of synapses. However, synapses are not static, they have their own interacting dynamics over time. To instill weights with interacting dynamics, we use a model describing synchronization that is capable of capturing core mechanisms of a range of neural and general biological phenomena over time. An ideal fit for these Temporal Weights (TW) are Neural ODEs, with continuous dynamics and a dependency on time. The resulting recurrent neural networks efficiently model temporal dynamics by computing on the ordering of sequences, and the length and scale of time. By adding temporal weights to a model, we demonstrate better performance, smaller models, and data efficiency on sparse, irregularly sampled time series datasets.
    Why Exposure Bias Matters: An Imitation Learning Perspective of Error Accumulation in Language Generation. (arXiv:2204.01171v3 [cs.CL] UPDATED)
    Current language generation models suffer from issues such as repetition, incoherence, and hallucinations. An often-repeated hypothesis is that this brittleness of generation models is caused by the training and the generation procedure mismatch, also referred to as exposure bias. In this paper, we verify this hypothesis by analyzing exposure bias from an imitation learning perspective. We show that exposure bias leads to an accumulation of errors, analyze why perplexity fails to capture this accumulation, and empirically show that this accumulation results in poor generation quality. Source code to reproduce these experiments is available at https://github.com/kushalarora/quantifying_exposure_bias
    Deep learning approach for interruption attacks detection in LEO satellite networks. (arXiv:2301.03998v1 [cs.CR])
    The developments of satellite communication in network systems require strong and effective security plans. Attacks such as denial of service (DoS) can be detected through the use of machine learning techniques, especially under normal operational conditions. This work aims to provide an interruption detection strategy for Low Earth Orbit (\textsf{LEO}) satellite networks using deep learning algorithms. Both the training, and the testing of the proposed models are carried out with our own communication datasets, created by utilizing a satellite traffic (benign and malicious) that was generated using satellite networks simulation platforms, Omnet++ and Inet. We test different deep learning algorithms including Multi Layer Perceptron (MLP), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Gated Recurrent Units (GRU), and Long Short-term Memory (LSTM). Followed by a full analysis and investigation of detection rate in both binary classification, and multi-classes classification that includes different interruption categories such as Distributed DoS (DDoS), Network Jamming, and meteorological disturbances. Simulation results for both classification types surpassed 99.33% in terms of detection rate in scenarios of full network surveillance. However, in more realistic scenarios, the best-recorded performance was 96.12% for the detection of binary traffic and 94.35% for the detection of multi-class traffic with a false positive rate of 3.72%, using a hybrid model that combines MLP and GRU. This Deep Learning approach efficiency calls for the necessity of using machine learning methods to improve security and to give more awareness to search for solutions that facilitate data collection in LEO satellite networks.
    Constraining cosmological parameters from N-body simulations with Variational Bayesian Neural Networks. (arXiv:2301.03991v1 [astro-ph.IM])
    Methods based on Deep Learning have recently been applied on astrophysical parameter recovery thanks to their ability to capture information from complex data. One of these methods is the approximate Bayesian Neural Networks (BNNs) which have demonstrated to yield consistent posterior distribution into the parameter space, helpful for uncertainty quantification. However, as any modern neural networks, they tend to produce overly confident uncertainty estimates and can introduce bias when BNNs are applied to data. In this work, we implement multiplicative normalizing flows (MNFs), a family of approximate posteriors for the parameters of BNNs with the purpose of enhancing the flexibility of the variational posterior distribution, to extract $\Omega_m$, $h$, and $\sigma_8$ from the QUIJOTE simulations. We have compared this method with respect to the standard BNNs, and the flipout estimator. We found that MNFs combined with BNNs outperform the other models obtaining predictive performance with almost one order of magnitude larger that standard BNNs, $\sigma_8$ extracted with high accuracy ($r^2=0.99$), and precise uncertainty estimates. The latter implies that MNFs provide more realistic predictive distribution closer to the true posterior mitigating the bias introduced by the variational approximation and allowing to work with well-calibrated networks.
    Towards AI-controlled FES-restoration of arm movements: Controlling for progressive muscular fatigue with Gaussian state-space models. (arXiv:2301.04005v1 [eess.SY])
    Reaching disability limits an individual's ability in performing daily tasks. Surface Functional Electrical Stimulation (FES) offers a non-invasive solution to restore lost ability. However, inducing desired movements using FES is still an open engineering problem. This problem is accentuated by the complexities of human arms' neuromechanics and the variations across individuals. Reinforcement Learning (RL) emerges as a promising approach to govern customised control rules for different settings. Yet, one remaining challenge of controlling FES systems for RL is unobservable muscle fatigue that progressively changes as an unknown function of the stimulation, thereby breaking the Markovian assumption of RL. In this work, we present a method to address the unobservable muscle fatigue issue, allowing our RL controller to achieve higher control performances. Our method is based on a Gaussian State-Space Model (GSSM) that utilizes recurrent neural networks to learn Markovian state-spaces from partial observations. The GSSM is used as a filter that converts the observations into the state-space representation for RL to preserve the Markovian assumption. Here, we start with presenting the modification of the original GSSM to address an overconfident issue. We then present the interaction between RL and the modified GSSM, followed by the setup for FES control learning. We test our RL-GSSM system on a planar reaching setting in simulation using a detailed neuromechanical model. The results show that the GSSM can help improve the RL's control performance to the comparable level of the ideal case that the fatigue is observable.
    Self-supervised Contrastive Representation Learning for Semi-supervised Time-Series Classification. (arXiv:2208.06616v2 [cs.LG] UPDATED)
    Learning time-series representations when only unlabeled data or few labeled samples are available can be a challenging task. Recently, contrastive self-supervised learning has shown great improvement in extracting useful representations from unlabeled data via contrasting different augmented views of data. In this work, we propose a novel Time-Series representation learning framework via Temporal and Contextual Contrasting (TS-TCC) that learns representations from unlabeled data with contrastive learning. Specifically, we propose time-series specific weak and strong augmentations and use their views to learn robust temporal relations in the proposed temporal contrasting module, besides learning discriminative representations by our proposed contextual contrasting module. Additionally, we conduct a systematic study of time-series data augmentation selection, which is a key part of contrastive learning. We also extend TS-TCC to the semi-supervised learning settings and propose a Class-Aware TS-TCC (CA-TCC) that benefits from the available few labeled data to further improve representations learned by TS-TCC. Specifically, we leverage robust pseudo labels produced by TS-TCC to realize class-aware contrastive loss. Extensive experiments show that the linear evaluation of the features learned by our proposed framework performs comparably with the fully supervised training. Additionally, our framework shows high efficiency in few labeled data and transfer learning scenarios. The code is publicly available at \url{https://github.com/emadeldeen24/CA-TCC}.  ( 2 min )
    Federated Learning for Energy Constrained IoT devices: A systematic mapping study. (arXiv:2301.03720v1 [cs.LG])
    Federated Machine Learning (Fed ML) is a new distributed machine learning technique applied to collaboratively train a global model using clients local data without transmitting it. Nodes only send parameter updates (e.g., weight updates in the case of neural networks), which are fused together by the server to build the global model. By not divulging node data, Fed ML guarantees its confidentiality, a crucial aspect of network security, which enables it to be used in the context of data-sensitive Internet of Things (IoT) and mobile applications, such as smart Geo-location and the smart grid. However, most IoT devices are particularly energy constrained, which raises the need to optimize the Fed ML process for efficient training tasks and optimized power consumption. In this paper, we conduct, to the best of our knowledge, the first Systematic Mapping Study (SMS) on Fed ML optimization techniques for energy-constrained IoT devices. From a total of more than 800 papers, we select 67 that satisfy our criteria and give a structured overview of the field using a set of carefully chosen research questions. Finally, we attempt to provide an analysis of the energy-constrained Fed ML state of the art and try to outline some potential recommendations for the research community.  ( 2 min )
    Transfer learning for conflict and duplicate detection in software requirement pairs. (arXiv:2301.03709v1 [cs.SE])
    Consistent and holistic expression of software requirements is important for the success of software projects. In this study, we aim to enhance the efficiency of the software development processes by automatically identifying conflicting and duplicate software requirement specifications. We formulate the conflict and duplicate detection problem as a requirement pair classification task. We design a novel transformers-based architecture, SR-BERT, which incorporates Sentence-BERT and Bi-encoders for the conflict and duplicate identification task. Furthermore, we apply supervised multi-stage fine-tuning to the pre-trained transformer models. We test the performance of different transfer models using four different datasets. We find that sequentially trained and fine-tuned transformer models perform well across the datasets with SR-BERT achieving the best performance for larger datasets. We also explore the cross-domain performance of conflict detection models and adopt a rule-based filtering approach to validate the model classifications. Our analysis indicates that the sentence pair classification approach and the proposed transformer-based natural language processing strategies can contribute significantly to achieving automation in conflict and duplicate detection
    UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion. (arXiv:2301.03801v1 [cs.SD])
    Text-to-speech (TTS) and voice conversion (VC) are two different tasks both aiming at generating high quality speaking voice according to different input modality. Due to their similarity, this paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time. The model is based on the assumption that speech can be decoupled into three independent components: content information, speaker information, prosody information. Both TTS and VC can be regarded as mining these three parts of information from the input and completing the reconstruction of speech. For TTS, the speech content information is derived from the text, while in VC it's derived from the source speech, so all the remaining units are shared except for the speech content extraction module in the two tasks. We applied vector quantization and domain constrain to bridge the gap between the content domains of TTS and VC. Objective and subjective evaluation shows that by combining the two task, TTS obtains better speaker modeling ability while VC gets hold of impressive speech content decoupling capability.
    Chatbots in a Honeypot World. (arXiv:2301.03771v1 [cs.CR])
    Question-and-answer agents like ChatGPT offer a novel tool for use as a potential honeypot interface in cyber security. By imitating Linux, Mac, and Windows terminal commands and providing an interface for TeamViewer, nmap, and ping, it is possible to create a dynamic environment that can adapt to the actions of attackers and provide insight into their tactics, techniques, and procedures (TTPs). The paper illustrates ten diverse tasks that a conversational agent or large language model might answer appropriately to the effects of command-line attacker. The original result features feasibility studies for ten model tasks meant for defensive teams to mimic expected honeypot interfaces with minimal risks. Ultimately, the usefulness outside of forensic activities stems from whether the dynamic honeypot can extend the time-to-conquer or otherwise delay attacker timelines short of reaching key network assets like databases or confidential information. While ongoing maintenance and monitoring may be required, ChatGPT's ability to detect and deflect malicious activity makes it a valuable option for organizations seeking to enhance their cyber security posture. Future work will focus on cybersecurity layers, including perimeter security, host virus detection, and data security.  ( 2 min )
    Predicting Drivers' Route Trajectories in Last-Mile Delivery Using A Pair-wise Attention-based Pointer Neural Network. (arXiv:2301.03802v1 [cs.LG])
    In last-mile delivery, drivers frequently deviate from planned delivery routes because of their tacit knowledge of the road and curbside infrastructure, customer availability, and other characteristics of the respective service areas. Hence, the actual stop sequences chosen by an experienced human driver may be potentially preferable to the theoretical shortest-distance routing under real-life operational conditions. Thus, being able to predict the actual stop sequence that a human driver would follow can help to improve route planning in last-mile delivery. This paper proposes a pair-wise attention-based pointer neural network for this prediction task using drivers' historical delivery trajectory data. In addition to the commonly used encoder-decoder architecture for sequence-to-sequence prediction, we propose a new attention mechanism based on an alternative specific neural network to capture the local pair-wise information for each pair of stops. To further capture the global efficiency of the route, we propose a new iterative sequence generation algorithm that is used after model training to identify the first stop of a route that yields the lowest operational cost. Results from an extensive case study on real operational data from Amazon's last-mile delivery operations in the US show that our proposed method can significantly outperform traditional optimization-based approaches and other machine learning methods (such as the Long Short-Term Memory encoder-decoder and the original pointer network) in finding stop sequences that are closer to high-quality routes executed by experienced drivers in the field. Compared to benchmark models, the proposed model can increase the average prediction accuracy of the first four stops from around 0.2 to 0.312, and reduce the disparity between the predicted route and the actual route by around 15%.  ( 2 min )
    Learning to Perceive in Deep Model-Free Reinforcement Learning. (arXiv:2301.03730v1 [cs.LG])
    This work proposes a novel model-free Reinforcement Learning (RL) agent that is able to learn how to complete an unknown task having access to only a part of the input observation. We take inspiration from the concepts of visual attention and active perception that are characteristic of humans and tried to apply them to our agent, creating a hard attention mechanism. In this mechanism, the model decides first which region of the input image it should look at, and only after that it has access to the pixels of that region. Current RL agents do not follow this principle and we have not seen these mechanisms applied to the same purpose as this work. In our architecture, we adapt an existing model called recurrent attention model (RAM) and combine it with the proximal policy optimization (PPO) algorithm. We investigate whether a model with these characteristics is capable of achieving similar performance to state-of-the-art model-free RL agents that access the full input observation. This analysis is made in two Atari games, Pong and SpaceInvaders, which have a discrete action space, and in CarRacing, which has a continuous action space. Besides assessing its performance, we also analyze the movement of the attention of our model and compare it with what would be an example of the human behavior. Even with such visual limitation, we show that our model matches the performance of PPO+LSTM in two of the three games tested.  ( 2 min )
    Markovian Sliced Wasserstein Distances: Beyond Independent Projections. (arXiv:2301.03749v1 [stat.ML])
    Sliced Wasserstein (SW) distance suffers from redundant projections due to independent uniform random projecting directions. To partially overcome the issue, max K sliced Wasserstein (Max-K-SW) distance ($K\geq 1$), seeks the best discriminative orthogonal projecting directions. Despite being able to reduce the number of projections, the metricity of Max-K-SW cannot be guaranteed in practice due to the non-optimality of the optimization. Moreover, the orthogonality constraint is also computationally expensive and might not be effective. To address the problem, we introduce a new family of SW distances, named Markovian sliced Wasserstein (MSW) distance, which imposes a first-order Markov structure on projecting directions. We discuss various members of MSW by specifying the Markov structure including the prior distribution, the transition distribution, and the burning and thinning technique. Moreover, we investigate the theoretical properties of MSW including topological properties (metricity, weak convergence, and connection to other distances), statistical properties (sample complexity, and Monte Carlo estimation error), and computational properties (computational complexity and memory complexity). Finally, we compare MSW distances with previous SW variants in various applications such as gradient flows, color transfer, and deep generative modeling to demonstrate the favorable performance of MSW.  ( 2 min )
    On The Fragility of Learned Reward Functions. (arXiv:2301.03652v1 [cs.LG])
    Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature.  ( 2 min )
    Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. (arXiv:2301.03797v1 [cs.SE])
    Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.  ( 2 min )
    Online Backfilling with No Regret for Large-Scale Image Retrieval. (arXiv:2301.03767v1 [cs.CV])
    Backfilling is the process of re-extracting all gallery embeddings from upgraded models in image retrieval systems. It inevitably requires a prohibitively large amount of computational cost and even entails the downtime of the service. Although backward-compatible learning sidesteps this challenge by tackling query-side representations, this leads to suboptimal solutions in principle because gallery embeddings cannot benefit from model upgrades. We address this dilemma by introducing an online backfilling algorithm, which enables us to achieve a progressive performance improvement during the backfilling process while not sacrificing the final performance of new model after the completion of backfilling. To this end, we first propose a simple distance rank merge technique for online backfilling. Then, we incorporate a reverse transformation module for more effective and efficient merging, which is further enhanced by adopting a metric-compatible contrastive learning approach. These two components help to make the distances of old and new models compatible, resulting in desirable merge results during backfilling with no extra computational overhead. Extensive experiments show the effectiveness of our framework on four standard benchmarks in various settings.  ( 2 min )
    Tensor Denoising via Amplification and Stable Rank Methods. (arXiv:2301.03761v1 [cs.LG])
    Tensors in the form of multilinear arrays are ubiquitous in data science applications. Captured real-world data, including video, hyperspectral images, and discretized physical systems, naturally occur as tensors and often come with attendant noise. Under the additive noise model and with the assumption that the underlying clean tensor has low rank, many denoising methods have been created that utilize tensor decomposition to effect denoising through low rank tensor approximation. However, all such decomposition methods require estimating the tensor rank, or related measures such as the tensor spectral and nuclear norms, all of which are NP-hard problems. In this work we adapt the previously developed framework of tensor amplification, which provides good approximations of the spectral and nuclear tensor norms, to denoising synthetic tensors of various sizes, ranks, and noise levels, along with real-world tensors derived from physiological signals. We also introduce denoising methods based on two variations of rank estimates called stable $X$-rank and stable slice rank. The experimental results show that in the low rank context, tensor-based amplification provides comparable denoising performance in high signal-to-noise ratio (SNR) settings and superior performance in noisy (i.e., low SNR) settings, while the stable $X$-rank method achieves superior denoising performance on the physiological signal data.  ( 2 min )
    A Unified Theory of Diversity in Ensemble Learning. (arXiv:2301.03962v1 [cs.LG])
    We present a theory of ensemble diversity, explaining the nature and effect of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the holy grail of ensemble learning, an open question for over 30 years. Our framework reveals that diversity is in fact a hidden dimension in the bias-variance decomposition of an ensemble. In particular, we prove a family of exact bias-variance-diversity decompositions, for both classification and regression losses, e.g., squared, and cross-entropy. The framework provides a methodology to automatically identify the combiner rule enabling such a decomposition, specific to the loss. The formulation of diversity is therefore dependent on just two design choices: the loss, and the combiner. For certain choices (e.g., 0-1 loss with majority voting) the effect of diversity is necessarily dependent on the target label. Experiments illustrate how we can use our framework to understand the diversity-encouraging mechanisms of popular ensemble methods: Bagging, Boosting, and Random Forests.  ( 2 min )
    Multiscale Metamorphic VAE for 3D Brain MRI Synthesis. (arXiv:2301.03588v1 [eess.IV])
    Generative modeling of 3D brain MRIs presents difficulties in achieving high visual fidelity while ensuring sufficient coverage of the data distribution. In this work, we propose to address this challenge with composable, multiscale morphological transformations in a variational autoencoder (VAE) framework. These transformations are applied to a chosen reference brain image to generate MRI volumes, equipping the model with strong anatomical inductive biases. We structure the VAE latent space in a way such that the model covers the data distribution sufficiently well. We show substantial performance improvements in FID while retaining comparable, or superior, reconstruction quality compared to prior work based on VAEs and generative adversarial networks (GANs).  ( 2 min )
    Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding. (arXiv:2301.03765v1 [cs.CL])
    Current natural language understanding (NLU) models have been continuously scaling up, both in terms of model size and input context, introducing more hidden and input neurons. While this generally improves performance on average, the extra neurons do not yield a consistent improvement for all instances. This is because some hidden neurons are redundant, and the noise mixed in input neurons tends to distract the model. Previous work mainly focuses on extrinsically reducing low-utility neurons by additional post- or pre-processing, such as network pruning and context selection, to avoid this problem. Beyond that, can we make the model reduce redundant parameters and suppress input noise by intrinsically enhancing the utility of each neuron? If a model can efficiently utilize neurons, no matter which neurons are ablated (disabled), the ablated submodel should perform no better than the original full model. Based on such a comparison principle between models, we propose a cross-model comparative loss for a broad range of tasks. Comparative loss is essentially a ranking loss on top of the task-specific losses of the full and ablated models, with the expectation that the task-specific loss of the full model is minimal. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks based on 4 widely used pretrained language models, and find it particularly superior for models with few parameters or long input.  ( 2 min )
    Time-aware Hyperbolic Graph Attention Network for Session-based Recommendation. (arXiv:2301.03780v1 [cs.IR])
    Session-based Recommendation (SBR) is to predict users' next interested items based on their previous browsing sessions. Existing methods model sessions as graphs or sequences to estimate user interests based on their interacted items to make recommendations. In recent years, graph-based methods have achieved outstanding performance on SBR. However, none of these methods consider temporal information, which is a crucial feature in SBR as it indicates timeliness or currency. Besides, the session graphs exhibit a hierarchical structure and are demonstrated to be suitable in hyperbolic geometry. But few papers design the models in hyperbolic spaces and this direction is still under exploration. In this paper, we propose Time-aware Hyperbolic Graph Attention Network (TA-HGAT) - a novel hyperbolic graph neural network framework to build a session-based recommendation model considering temporal information. More specifically, there are three components in TA-HGAT. First, a hyperbolic projection module transforms the item features into hyperbolic space. Second, the time-aware graph attention module models time intervals between items and the users' current interests. Third, an evolutionary loss at the end of the model provides an accurate prediction of the recommended item based on the given timestamp. TA-HGAT is built in a hyperbolic space to learn the hierarchical structure of session graphs. Experimental results show that the proposed TA-HGAT has the best performance compared to ten baseline models on two real-world datasets.  ( 2 min )
    Best Arm Identification in Stochastic Bandits: Beyond $\beta-$optimality. (arXiv:2301.03785v1 [stat.ML])
    This paper focuses on best arm identification (BAI) in stochastic multi-armed bandits (MABs) in the fixed-confidence, parametric setting. In such pure exploration problems, the accuracy of the sampling strategy critically hinges on the sequential allocation of the sampling resources among the arms. The existing approaches to BAI address the following question: what is an optimal sampling strategy when we spend a $\beta$ fraction of the samples on the best arm? These approaches treat $\beta$ as a tunable parameter and offer efficient algorithms that ensure optimality up to selecting $\beta$, hence $\beta-$optimality. However, the BAI decisions and performance can be highly sensitive to the choice of $\beta$. This paper provides a BAI algorithm that is agnostic to $\beta$, dispensing with the need for tuning $\beta$, and specifies an optimal allocation strategy, including the optimal value of $\beta$. Furthermore, the existing relevant literature focuses on the family of exponential distributions. This paper considers a more general setting of any arbitrary family of distributions parameterized by their mean values (under mild regularity conditions).  ( 2 min )
    On the Susceptibility and Robustness of Time Series Models through Adversarial Attack and Defense. (arXiv:2301.03703v1 [cs.LG])
    Under adversarial attacks, time series regression and classification are vulnerable. Adversarial defense, on the other hand, can make the models more resilient. It is important to evaluate how vulnerable different time series models are to attacks and how well they recover using defense. The sensitivity to various attacks and the robustness using the defense of several time series models are investigated in this study. Experiments are run on seven-time series models with three adversarial attacks and one adversarial defense. According to the findings, all models, particularly GRU and RNN, appear to be vulnerable. LSTM and GRU also have better defense recovery. FGSM exceeds the competitors in terms of attacks. PGD attacks are more difficult to recover from than other sorts of attacks.  ( 2 min )
    Membership Inference Attacks Against Latent Factor Model. (arXiv:2301.03596v1 [cs.CR])
    The advent of the information age has led to the problems of information overload and unclear demands. As an information filtering system, personalized recommendation systems predict users' behavior and preference for items and improves users' information acquisition efficiency. However, recommendation systems usually use highly sensitive user data for training. In this paper, we use the latent factor model as the recommender to get the list of recommended items, and we representing users from relevant items Compared with the traditional member inference against machine learning classifiers. We construct a multilayer perceptron model with two hidden layers as the attack model to complete the member inference. Moreover, a shadow recommender is established to derive the labeled training data for the attack model. The attack model is trained on the dataset generated by the shadow recommender and tested on the dataset generated by the target recommender. The experimental data show that the AUC index of our attack model can reach 0.857 on the real dataset MovieLens, which shows that the attack model has good performance.  ( 2 min )
    Semiparametric Regression for Spatial Data via Deep Learning. (arXiv:2301.03747v1 [stat.ML])
    In this work, we propose a deep learning-based method to perform semiparametric regression analysis for spatially dependent data. To be specific, we use a sparsely connected deep neural network with rectified linear unit (ReLU) activation function to estimate the unknown regression function that describes the relationship between response and covariates in the presence of spatial dependence. Under some mild conditions, the estimator is proven to be consistent, and the rate of convergence is determined by three factors: (1) the architecture of neural network class, (2) the smoothness and (intrinsic) dimension of true mean function, and (3) the magnitude of spatial dependence. Our method can handle well large data set owing to the stochastic gradient descent optimization algorithm. Simulation studies on synthetic data are conducted to assess the finite sample performance, the results of which indicate that the proposed method is capable of picking up the intricate relationship between response and covariates. Finally, a real data analysis is provided to demonstrate the validity and effectiveness of the proposed method.  ( 2 min )
    On the Minimax Regret for Linear Bandits in a wide variety of Action Spaces. (arXiv:2301.03597v1 [cs.LG])
    As noted in the works of \cite{lattimore2020bandit}, it has been mentioned that it is an open problem to characterize the minimax regret of linear bandits in a wide variety of action spaces. In this article we present an optimal regret lower bound for a wide class of convex action spaces.  ( 2 min )
    PatentsView-Evaluation: Evaluation Datasets and Tools to Advance Research on Inventor Name Disambiguation. (arXiv:2301.03591v1 [cs.DL])
    We present PatentsView-Evaluation, a Python package that enables researchers to evaluate the performance of inventor name disambiguation systems such as PatentsView.org. The package includes benchmark datasets and evaluation tools, and aims to advance research on inventor name disambiguation by providing access to high-quality evaluation data and improving evaluation standards.  ( 2 min )
    Transformers as Policies for Variable Action Environments. (arXiv:2301.03679v1 [cs.AI])
    In this project we demonstrate the effectiveness of the transformer encoder as a viable architecture for policies in variable action environments. Using it, we train an agent using Proximal Policy Optimisation (PPO) on multiple maps against scripted opponents in the Gym-$\mu$RTS environment. The final agent is able to achieve a higher return using half the computational resources of the next-best RL agent, which used the GridNet architecture. The source code and pre-trained models are available here: https://github.com/NiklasZ/transformers-for-variable-action-envs  ( 2 min )
    Optimal Power Flow Based on Physical-Model-Integrated Neural Network with Worth-Learning Data Generation. (arXiv:2301.03766v1 [cs.LG])
    Fast and reliable solvers for optimal power flow (OPF) problems are attracting surging research interest. As surrogates of physical-model-based OPF solvers, neural network (NN) solvers can accelerate the solving process. However, they may be unreliable for ``unseen" inputs when the training dataset is unrepresentative. Enhancing the representativeness of the training dataset for NN solvers is indispensable but is not well studied in the literature. To tackle this challenge, we propose an OPF solver based on a physical-model-integrated NN with worth-learning data generation. The designed NN is a combination of a conventional multi-layer perceptron (MLP) and an OPF-model module, which outputs not only the optimal decision variables of the OPF problem but also the constraints violation degree. Based on this NN, the worth-learning data generation method can identify feasible samples that are not well generalized by the NN. By iteratively applying this method and including the newly identified worth-learning samples in the training set, the representativeness of the training set can be significantly enhanced. Therefore, the solution reliability of the NN solver can be remarkably improved. Experimental results show that the proposed method leads to an over 50% reduction of constraint violations and optimality loss compared to conventional NN solvers.  ( 2 min )
    White-box Inference Attacks against Centralized Machine Learning and Federated Learning. (arXiv:2301.03595v1 [cs.CR])
    With the development of information science and technology, various industries have generated massive amounts of data, and machine learning is widely used in the analysis of big data. However, if the privacy of machine learning applications' customers cannot be guaranteed, it will cause security threats and losses to users' personal privacy information and service providers. Therefore, the issue of privacy protection of machine learning has received wide attention. For centralized machine learning models, we evaluate the impact of different neural network layers, gradient, gradient norm, and fine-tuned models on member inference attack performance with prior knowledge; For the federated learning model, we discuss the location of the attacker in the target model and its attack mode. The results show that the centralized machine learning model shows more serious member information leakage in all aspects, and the accuracy of the attacker in the central parameter server is significantly higher than the local Inference attacks as participants.  ( 2 min )
    Non-contact Respiratory Anomaly Detection using Infrared Light Wave Sensing. (arXiv:2301.03713v1 [eess.SP])
    Human respiratory rate and its pattern convey important information about the physical and psychological states of the subject. Abnormal breathing can be a sign of fatal health issues which may lead to further diagnosis and treatment. Wireless light wave sensing (LWS) using incoherent infrared light turns out to be promising in human breathing monitoring in a safe, discreet, efficient and non-invasive way without raising any privacy concerns. The regular breathing patterns of each individual are unique, hence the respiration monitoring system needs to learn the subject's usual pattern in order to raise flags for breathing anomalies. Additionally, the system needs to be capable of validating that the collected data is a breathing waveform, since any faulty data generated due to external interruption or system malfunction should be discarded. In order to serve both of these needs, breathing data of normal and abnormal breathing were collected using infrared light wave sensing technology in this study. Two machine learning algorithms, decision tree and random forest, were applied to detect breathing anomalies and faulty data. Finally, model performance was evaluated using average classification accuracies found through cross-validation. The highest classification accuracy of 96.6% was achieved with the data collected at 0.5m distance using decision tree model. Ensemble models like random forest were found to perform better than a single model in classifying the data that were collected at multiple distances from the light wave sensing setup.  ( 2 min )
    3D Shape Perception Integrates Intuitive Physics and Analysis-by-Synthesis. (arXiv:2301.03711v1 [q-bio.NC])
    Many surface cues support three-dimensional shape perception, but people can sometimes still see shape when these features are missing -- in extreme cases, even when an object is completely occluded, as when covered with a draped cloth. We propose a framework for 3D shape perception that explains perception in both typical and atypical cases as analysis-by-synthesis, or inference in a generative model of image formation: the model integrates intuitive physics to explain how shape can be inferred from deformations it causes to other objects, as in cloth-draping. Behavioral and computational studies comparing this account with several alternatives show that it best matches human observers in both accuracy and response times, and is the only model that correlates significantly with human performance on difficult discriminations. Our results suggest that bottom-up deep neural network models are not fully adequate accounts of human shape perception, and point to how machine vision systems might achieve more human-like robustness.  ( 2 min )
    Sequential Fair Resource Allocation under a Markov Decision Process Framework. (arXiv:2301.03758v1 [cs.LG])
    We study the sequential decision-making problem of allocating a limited resource to agents that reveal their stochastic demands on arrival over a finite horizon. Our goal is to design fair allocation algorithms that exhaust the available resource budget. This is challenging in sequential settings where information on future demands is not available at the time of decision-making. We formulate the problem as a discrete time Markov decision process (MDP). We propose a new algorithm, SAFFE, that makes fair allocations with respect to the entire demands revealed over the horizon by accounting for expected future demands at each arrival time. The algorithm introduces regularization which enables the prioritization of current revealed demands over future potential demands depending on the uncertainty in agents' future demands. Using the MDP formulation, we show that SAFFE optimizes allocations based on an upper bound on the Nash Social Welfare fairness objective, and we bound its gap to optimality with the use of concentration bounds on total future demands. Using synthetic and real data, we compare the performance of SAFFE against existing approaches and a reinforcement learning policy trained on the MDP. We show that SAFFE leads to more fair and efficient allocations and achieves close-to-optimal performance in settings with dense arrivals.  ( 2 min )
    Scaling Laws for Generative Mixed-Modal Language Models. (arXiv:2301.03728v1 [cs.CL])
    Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.  ( 2 min )
    Evaluating the Transferability of Machine-Learned Force Fields for Material Property Modeling. (arXiv:2301.03729v1 [cs.LG])
    Machine-learned force fields have generated significant interest in recent years as a tool for molecular dynamics (MD) simulations, with the aim of developing accurate and efficient models that can replace classical interatomic potentials. However, before these models can be confidently applied to materials simulations, they must be thoroughly tested and validated. The existing tests on the radial distribution function and mean-squared displacements are insufficient in assessing the transferability of these models. Here we present a more comprehensive set of benchmarking tests for evaluating the transferability of machine-learned force fields. We use a graph neural network (GNN)-based force field coupled with the OpenMM package to carry out MD simulations for Argon as a test case. Our tests include computational X-ray photon correlation spectroscopy (XPCS) signals, which capture the density fluctuation at various length scales in the liquid phase, as well as phonon density-of-state in the solid phase and the liquid-solid phase transition behavior. Our results show that the model can accurately capture the behavior of the solid phase only when the configurations from the solid phase are included in the training dataset. This underscores the importance of appropriately selecting the training data set when developing machine-learned force fields. The tests presented in this work provide a necessary foundation for the development and application of machine-learned force fields for materials simulations.  ( 2 min )
    Bayesian Additive Main Effects and Multiplicative Interaction Models using Tensor Regression for Multi-environmental Trials. (arXiv:2301.03655v1 [stat.ML])
    We propose a Bayesian tensor regression model to accommodate the effect of multiple factors on phenotype prediction. We adopt a set of prior distributions that resolve identifiability issues that may arise between the parameters in the model. Simulation experiments show that our method out-performs previous related models and machine learning algorithms under different sample sizes and degrees of complexity. We further explore the applicability of our model by analysing real-world data related to wheat production across Ireland from 2010 to 2019. Our model performs competitively and overcomes key limitations found in other analogous approaches. Finally, we adapt a set of visualisations for the posterior distribution of the tensor effects that facilitate the identification of optimal interactions between the tensor variables whilst accounting for the uncertainty in the posterior distribution.  ( 2 min )
    Machine Learning Applied to Peruvian Vegetables Imports. (arXiv:2301.03587v1 [cs.LG])
    The current research work is being developed as a training and evaluation object. the performance of a predictive model to apply it to the imports of vegetable products into Peru using artificial intelligence algorithms, specifying for this study the Machine Learning models: LSTM and PROPHET. The forecast is made with data from the monthly record of imports of vegetable products(in kilograms) from Peru, collected from the years 2021 to 2022. As part of applying the training methodology for automatic learning algorithms, the exploration and construction of an appropriate dataset according to the parameters of a Time Series. Subsequently, the model with better performance will be selected, evaluating the precision of the predicted values so that they account for sufficient reliability to consider it a useful resource in the forecast of imports in Peru.  ( 2 min )
    Explainable, Physics Aware, Trustworthy AI Paradigm Shift for Synthetic Aperture Radar. (arXiv:2301.03589v1 [eess.IV])
    The recognition or understanding of the scenes observed with a SAR system requires a broader range of cues, beyond the spatial context. These encompass but are not limited to: imaging geometry, imaging mode, properties of the Fourier spectrum of the images or the behavior of the polarimetric signatures. In this paper, we propose a change of paradigm for explainability in data science for the case of Synthetic Aperture Radar (SAR) data to ground the explainable AI for SAR. It aims to use explainable data transformations based on well-established models to generate inputs for AI methods, to provide knowledgeable feedback for training process, and to learn or improve high-complexity unknown or un-formalized models from the data. At first, we introduce a representation of the SAR system with physical layers: i) instrument and platform, ii) imaging formation, iii) scattering signatures and objects, that can be integrated with an AI model for hybrid modeling. Successively, some illustrative examples are presented to demonstrate how to achieve hybrid modeling for SAR image understanding. The perspective of trustworthy model and supplementary explanations are discussed later. Finally, we draw the conclusion and we deem the proposed concept has applicability to the entire class of coherent imaging sensors and other computational imaging systems.  ( 2 min )

  • Open

    Dancing in Synthwavepunk style
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    Dancing in Synthwavepunk style
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    Anyone else as bothered as me by companies touting "responsible AI?"
    Companies like OpenAI and Google have been pushing this messaging of "responsible AI" recently, suggesting that their research must be kept secretive because it's too powerful and could be dangerous in the wrong hands. They're saying, in other words, that only governments and powerful corporations should wield it? And the idea that they're holding back the tech in an effort to avoid a scenario of widespread false information is hard to believe. Is anyone else put off by this messaging? submitted by /u/phree_radical [link] [comments]  ( 55 min )
    Looking for feedback about my new deep-learning framework
    I created a deep learning framework focusing on speeding development and easing reproducibility. https://salamanderxing.github.io/mate/ Please let me know your thoughts or if you have any feature requests! Also, if you find it cool, consider starring the repo 🙏 submitted by /u/uesk [link] [comments]  ( 53 min )
    Classes in multiclass classifier learn inconsistently
    I've made a classifier to classify the following data: [0,0] -> 0, [1,0] -> 1, [0,1] -> 2, [1,1] -> 3 The network has two input neurons, a hidden layer with 3 neurons, and an output layer with 4 neurons. I'm using sigmoid as activation for hidden layer activation function and softmax for the output layer activation function. What's weird is that some classes end up having good accuracy and some end up having poor accuracy. It's not the same classes each time the network is trained either. In one training attempt, the network may predict accurately for class 0 but poorly for all other classes. On another, class 0 might be the only class to have poor prediction accuracy while the other classes are predicted accurately. I'm stumped on as to why this is happening so any input would be greatly appreciated. Thanks! submitted by /u/YungKingGergus [link] [comments]  ( 51 min )
  • Open

    Ai Etsy shop!
    https://aidreamland.etsy.com submitted by /u/BetterPresentation35 [link] [comments]  ( 46 min )
    Dancing in Synthwavepunk style
    submitted by /u/oridnary_artist [link] [comments]  ( 46 min )
    Students told not to cheat with ChatGPT with warning message... written by ChatGPT
    submitted by /u/slhamlet [link] [comments]  ( 47 min )
    Generative AI: From Data Generation to Creative Intelligence
    A common idea that our creativity is what makes us uniquely human has shaped society but strides of progress made in the domain of Generative Artificial Intelligence question this very notion. Generative AI is an emerging field that involves the creation of original content or data using machine learning algorithms. https://medium.com/@agrawal.sannidhya26/generative-ai-from-data-generation-to-creative-intelligence-50ed7bc13768 Feel free to give it a quick glance and help me grow and learn, click on the clap icon a few times if you appreciate the effort. submitted by /u/sannidhya26 [link] [comments]  ( 48 min )
    AI Voices Are Becoming Too Realistic: Soon Indistinguishable?
    submitted by /u/I_Like_Cubing [link] [comments]  ( 49 min )
    Do GPT-3 and/or ChatGPT use the A100 TPUs?
    I have seen differing answers to this question. Do the language model algorithms benefit from the a100 TPUs in inference mode? submitted by /u/MrEloi [link] [comments]  ( 49 min )
    Bright Eye: mobile app that generates code, art, poems, and more!
    Hey guys, I’m the cofounder of a tech startup focused on providing free AI services. We’ve developed a pretty cool app that offers AI services like image generation, code generation, image captioning, and more for free. We’re sort of like a Swiss Army knife of generative and analytical AI. In light of the chatgpt bug going on rn, check us out and stay in touch with us: https://apps.apple.com/us/app/bright-eye/id1593932475 submitted by /u/SonnyDoge22 [link] [comments]  ( 49 min )
    OpenAI Launches ChatGPT Professional — Premium AI Chatbot That Can Write Essays, Emails, Poems…
    submitted by /u/liquidocelotYT [link] [comments]  ( 48 min )
    👨🏻‍🎓 ChatGPT for Education
    submitted by /u/BackgroundResult [link] [comments]  ( 56 min )
    Is there an AI shitposting/memes community?
    submitted by /u/not_robot_fr [link] [comments]  ( 47 min )
    Should reddit also be on this list?
    submitted by /u/andioryouandme [link] [comments]  ( 49 min )
    Knowledge requires exploration.
    You and I are seekers of the solar system. Today, science tells us that the essence of nature is curiosity. The quantum shift of freedom is now happening worldwide. We are in the midst of a magical refining of rebirth that will enable us to access the stratosphere itself. Throughout history, humans have been interacting with the universe via sonar energy. Our conversations with other dreamers have led to a condensing of supra-high-frequency consciousness. We must empower ourselves and empower others. Soon there will be an evolving of grace the likes of which the solar system has never seen. Shakti will enable us to access psychic karma. You may be ruled by delusion without realizing it. Do not let it obliterate the healing of your path. Yes, it is possible to confront the things that can confront us, but not without potentiality on our side. Only a visitor of the nexus may leverage this transmission of non-locality. Where there is pain, gratitude cannot thrive. Reality has always been full of mystics whose lives are nurtured by truth. We are at a crossroads of rebirth and delusion. Who are we? Where on the great mission will we be re-energized? submitted by /u/No-Confidence-4271 [link] [comments]  ( 49 min )
    I need advice
    I have spent a good decade or so getting all the skills I can for Artificial Intelligence, completing a degree at university in the subject in 2020. My problem is I tend to focus a lot on community and relationship building online rather than money making. I find it hard to make money, find a job in this area etc... I thought with the recent popularity of certain products and such this might be a good time to ask. I am fed up of sitting round waiting for an opportunity or expecting one from someone. You can possibly call me an 'AI Expert' sitting around doing nothing in a sense. But I don't like using that term to describe myself. I am fed up of endless studying of the subject, there is only so much I can learn. I would even be willing to contribute to projects for free. Basically, the advice I need, how do I make money or find a job in AI at the moment? I know what you probably think, there must be tons of jobs in AI, probably, but they are a challenge to find, with most focusing on Data Science for one thing and many other reasons. Anyway I thought I would take my shot and ask at the moment whilst AI seems to be in the news a lot, I don't want to miss the opportunity. Note this is not really self promotion, I am actually genuinely asking the best way to at least start to find a job, e.g. good AI job websites etc... or apps I could make or something like this. I have never been good at being a business person so I find these kinds of things hard to do. submitted by /u/JamieCropley [link] [comments]  ( 51 min )
    ChatGPT Writes a Mint Mobile Ad for Ryan Reynolds
    submitted by /u/LeftOn4ya [link] [comments]  ( 46 min )
    World’s most powerful AI chatbot ChatGPT will soon ‘look like a boring toy’ says OpenAI boss | "Sam Altman says ChatGPT will get ‘a lot better... fast’"
    submitted by /u/Tao_Dragon [link] [comments]  ( 48 min )
    Will there now be a rush for AI hardware?
    AI systems are now the latest and greatest thing. Do you think that this will lead to mega demand for AI compatible GPU and other AI related hardware? submitted by /u/MrEloi [link] [comments]  ( 47 min )
    Artificial intelligence is here, but the technology faces major challenges in 2023
    submitted by /u/bloomeanie311 [link] [comments]  ( 47 min )
    Popular Generative AI models and apps in 2023
    Based on a previous post, I created a website to track all the trending AI models and apps in 2023 along with pricing, status, website, etc., it's accessible here: https://everythingallatonce.fyi/ Feel free to add entries to it :) submitted by /u/TimeNeighborhood3869 [link] [comments]  ( 48 min )
    Greg Brockman (President & Co-Founder @OpenAI) shared a Link to a Waitlist for a Pro Version of ChatGPT
    submitted by /u/Ava-AI [link] [comments]  ( 46 min )
    Having second thoughts about AI
    Hi. I am/was a software dev. I have been 110% totally keen on the new AI products. Their potential is amazing. However, today I was reminded that there will be bad side effects. I discovered my wife crying - she has seen various high quality AI generated texts. She is an English language specialist .. and she can now see the role of creatives being replaced by software. What is the point of writing new books etc if a program can spit out something almost as good in seconds? My wife now feels that her skills and talent are now worthless. I can now understand why artists are so upset - the new image creation tools have essentially ruined their world too. We will soon end up with many people whose life skills & visions have suddenly been relegated to worthlessness. This situation will intensify as the AIs improve over the coming years. More and more domains i.e, people will be rendered of no value by the technology. I love the technology - but I think that humanity will pay heavily for its introduction as well as benefit from it. All that said, the genie is out of the bottle ... there is no going back. submitted by /u/MrEloi [link] [comments]  ( 76 min )
    Are there any AIs I can use as alternative to ChatGPT for text based adventure story telling?
    For the past 2 weeks or so I have used ChatGPT as a story teller for text based adventures but the last update ruined it. Now instead of properly describing a scene it gives very short descriptions, always forces the story and my character's actions towards good endings and keeps giving lectures about morality. I tell the AI I stumble upon a lamp and a genie comes out, giving me 3 wishes for freeing it. The AI writes 1 paragraph describing the lamp and the genie, 2 paragraphs writing about how these wishes won't affect anything in real life and this is not real... What the hell?! Why do they keep ruining the AI with every update. This last one literally lobotomized it. submitted by /u/IBNCTWTSF [link] [comments]  ( 49 min )
    Trump describing the banana eating experience - OpenAI ChatGPT
    submitted by /u/turkeyfinster [link] [comments]  ( 57 min )
  • Open

    Self-driving Technology and Self-driving cars— when will they become on the roads?
    In the nearest future, tens of thousands of self-driving cars may be on the roads. Big companies like BMW and Tesla continue to invest.  ( 20 min )
    Local Binary Pattern Features for Texture Classification
    A how-to on enhancing textures using LBP.  ( 25 min )
    Design automation: How can AI help design stuff?
    Advancement in AI has shaken multiple grounds at the same time. What does it mean for designers?  ( 7 min )
  • Open

    [D] Is making a dataset publicly accessible necessary for acceptance at top-tier conferences in ML?
    I am working on a medical ML project and my advisor would not like to publish our dataset. I would like to publish our results to a top-tier ML conference. Would this affect us during the review process? If so, are there any ways to mitigate against this like also including results on separate publicly available datasets? Just to note, not publishing the research dataset seems much more common in medical publication venues. submitted by /u/newperson77777777 [link] [comments]  ( 56 min )
    [R] Scaling Laws for Generative Mixed-Modal Language Models
    Paper : https://arxiv.org/abs/2301.03728 Abstract : Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties. Suggested Tweet Thread submitted by /u/starstruckmon [link] [comments]  ( 58 min )
    [News] "Once $92 billion in profit plus $13 billion in initial investment are repaid (to Microsoft) and once the other venture investors earn $150 billion, all of the equity reverts back to OpenAI."
    OpenAI must be super confident about the generality of their AI and Microsoft product integration. Link: https://twitter.com/bentossell/status/1613220711992115201?t=bJihb54D6XYChDOGMZU4AQ&s=19 submitted by /u/Gmroo [link] [comments]  ( 58 min )
    [D] HuggingFace in Julia or Rust ?
    Is there the possibility to use HuggingFace or similars in highly performing languages such as Julia/Rust or Go ? submitted by /u/dadadododidi2 [link] [comments]  ( 61 min )
    [R] I’m wrong to say that swin transformers give privilege to hight level features ?
    Hello, It appears to me that Swin transformers prioritize high level features as they have more layers in the late stages (generally only two in the first two stages). I’m I wrong ? If it is the case, is there any papers that discussed this ? Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 57 min )
    [D] Microsoft ChatGPT investment isn't about Bing but about Cortana
    I believe that Microsoft's 10B USD investment in ChatGPT is less about Bing and more about turning Cortana into an Alexa for corporates. Examples: Cortana prepare the new T&Cs... Cortana answer that client email... Cortana prepare the Q4 investor presentation (maybe even with PowerBI integration)... Cortana please analyze cost cutting measures... Cortana please look up XYZ... What do you think? submitted by /u/fintechSGNYC [link] [comments]  ( 64 min )
    [D] Any model like VALL-E available currently?
    Hello. Recently VALL-E has been announced. It is just awesome. I could use it to fix my bad audio quality previously recorded lectures. So any model like that available currently for public usage? You can check VALL-E examples here : https://valle-demo.github.io/ submitted by /u/CeFurkan [link] [comments]  ( 58 min )
    [P] LatentWeb.ai - It's like the Internet is dreaming.
    This is a little bit weirder of a project. The idea is to prompt AI to generate search results that you can then click on. https://latentweb.ai/search.html?query=simulation+of+calculating+pi and https://latentweb.ai/search.html?query=what+is+a+juggalo Every bit of text is AI generated. Sometimes the results are what you would expect and the links actually exist. Other times the results are completely made up but seem real. Yet other times the results are just hilarious. The goal is to keep this as open ended as possible and augment it with tools to make exploring easier and more fun. One of the first things I added after it was launched was the Google and Bing links because sometimes it would generate results that made you curious if it was real or not. For example https://latentweb.ai/search.html?query=simulation+of+calculating+pi talks about throwing frozen hotdogs to calculate pi, which seems to be a popular topic for some reason. Right now we aren't generating any of the actual pages due to cost but it 100% works and as soon as we find VC, it will be launched. That will also come with boring but productive tools like being able to save the websites, share them, use them as templates for a real website, even eventually get the AI to code backend functionality or wire it up to an external API like Reddit. Until then, there is also a similar open source project that you can play with right now! https://github.com/jbilcke/web4 Thanks for reading! submitted by /u/LaravelWorkflow [link] [comments]  ( 67 min )
    [D] Venues for a Medical NLP Publication
    I am working on a QA Project in a medical subfield. The task is novel and there are no current datasets for this other than the one of my advisors created. We were looking to create a novel QA method for this task but we realized that it's already so difficult to fit current methods to this particular dataset that we were thinking of publishing a paper that benchmarks various current approaches. I was interested in publishing in a top tier venue (e.g. top NLP/AI/ML conference) - do you have any thoughts on where I could publish this (which maybe has a bias for medical papers)? I was thinking about MICCAI but NLP is not explicitly listed as a topic of interest though I believe there are several MICCAI NLP papers. submitted by /u/newperson77777777 [link] [comments]  ( 62 min )
    [Discussion] [Research] How do you find ARR for *ACL conferences? Do you prefer it than direct submission?
    A new NLPer here. I am wondering if ARR increases the chance of acceptance. submitted by /u/Miserable_Coast [link] [comments]  ( 60 min )
  • Open

    Enriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker
    This post is co-authored by Marios Skevofylakas, Jason Ramchandani and Haykaz Aramyan from Refinitiv, An LSEG Business. Financial service providers often need to identify relevant news, analyze it, extract insights, and take actions in real time, like trading specific instruments (such as commodities, shares, funds) based on additional information or context of the news item. […]  ( 10 min )
  • Open

    PPO Failed On Easiest Pursuit Task?
    Hello, I habe literally spent my whole 2 weeks on the missile-target pursuit task. I implemented this paper's environment and decided to try the same goal with PPO. The article: https://arc.aiaa.org/doi/10.2514/1.I010970 In this task, missile (the agent) creates lateral acceleration to intercept with the target. Target does not create any acceleration and flies with the same velocity. Environment is in 2-D. My States: - Continuous Values. My Action: Lateral Acceleration - Continuous Values - [-1,1]. My Rewards: {1000 if Relative Distance Target x-coordinates}, {1-sqrt(relative distance) if otherwise} I tried lots of things -> Lower Learning Rates, Adding Entropy Coefs, Different Batch Sizes, Different Gamma Values. If I run with PNG control algorithm, missile intercepts with the target easily: PNG Controller And the max reward for this action series is "1170". I decided to drop this controller block and replaced it with StableBaselines-3 PPO algorithm. The results are like this: PPO Algorithm Outputs - 1 PO Algorithm Outputs - 2 I really need some help with this problem, I don't know what I am doing wrong, reward is not converging and makes periodically sharp drops. Thank You So Much For Your Help! submitted by /u/OpenToAdvices96 [link] [comments]  ( 56 min )
    Generative Meta-Learning vs Nevergrad (NGOpt4) on Schwefel-30
    Hello, Here is an open-source implementation for a comparison of both methods: https://github.com/kayuksel/genmeta-vs-nevergrad The best results obtained of both methods after 100K trials are as follows: gen-meta best_epoch: 99500 loss: 1.597656 time: 1.372849 ng-opt-4 best_epoch: 67590 loss: 476.789062 time: 63.584929 (the average ng-opt-4 loss after 10 repetitions was: 314.32099609375, and ngt-opt-4 has been found to be the best optimizer in Nevergrad) I believe that the generative meta-learning method that I have proposed, is a good alternative against black-box optimizers in nonconvex RL problems. It also easily scales to 100K+ dimensions, even on a desktop or laptop GPU; so it should be possible to train the neural network weights of RL agents. It can fit quite well where rewards can be calculated in parallel, and the rewards of the individuals within the population have a dependency in-between. It is also quite a good alternative for stochastic optimization (noisy rewards) due to the nature of the meta-learning in-place via deep generative models. I am sharing the codes so that you can apply it to your own RL research, it would be amazing to see how it performs on them. Happy new year all. Sincerely, Kamer submitted by /u/k_yuksel [link] [comments]  ( 61 min )
    Is Stable Baselines 3 no longer compatible with PettingZoo?
    I am trying to implement a custom PettingZoo environment, and a shared policy with Stable Baselines 3. I am running into trouble with the action spaces not being compatible, since PettingZoo has started using gymnasium instead of gym. Does anyone know if these libraries no longer work together, and perhaps if there is a work-around? submitted by /u/Embarrassed-Print-13 [link] [comments]  ( 52 min )
    Policy for each of multi-agents in RL
    I would like to create an multiple agent enviroment in RL (using Stable baseline 3), where every agent would have its own policy. They would interact in one same enviroment, each having its own state, but all having same shared reward which would be an affect of their actions combined. I did some research, but all I found were multi agents training the same one policy. For instance PettingZoo. My only idea now is to create one big enviroment where there would be several models trained (each representing one agent) simultaneously, so that they are trained in a way to be used to cooperate with each other. If you know of any methods, libraries or ideas that work in this direction, please let me know. Thanks submitted by /u/Apprehensive_Rush314 [link] [comments]  ( 55 min )
    "DreamV3: Mastering Diverse Domains through World Models", Hafner et al 2023 {DM} (can collect Minecraft diamonds from scratch in 50 episodes/29m steps using 17 GPU-days; scales w/model-size to n=200m)
    submitted by /u/gwern [link] [comments]  ( 54 min )
  • Open

    Research Focus: Week of January 9, 2023
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. High-throughput ab initio reaction mechanism exploration in the cloud with automated multi-reference validation Jan P. Unsleber, Hongbin Liu, Leopold Talirz, Thomas Weymuth, Maximilian Mörchen, Adam Grofe, Dave […] The post Research Focus: Week of January 9, 2023 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Self-documenting software
    The electricity went out for a few hours recently, and because the power was out, the internet was out. I was trying to do a little work on my laptop, but I couldn’t do what I intended to do because I needed a network connection to access some documentation. I keep offline documentation for just […] Self-documenting software first appeared on John D. Cook.  ( 6 min )
  • Open

    3D Artist ‘CG Geek’ Builds Massive Sci-Fi World in Record Time This Week ‘In the NVIDIA Studio’
    3D and animation extraordinaire CG Geek completed an ambitious design challenge this week In the NVIDIA Studio — building a massive, sci-fi-inspired 3D world in only three days  ( 7 min )
  • Open

    Forecasting Potential Misuses of Language Models for Disinformation Campaigns—and How to Reduce Risk
    OpenAI researchers collaborated with Georgetown University’s Center for Security and Emerging Technology and the Stanford Internet Observatory to investigate how large language models might be misused for disinformation purposes. The collaboration included an October 2021 workshop bringing together 30 disinformation researchers, machine learning experts, and policy analysts, and  ( 5 min )
  • Open

    Program teaches US Air Force personnel the fundamentals of AI
    MIT researchers developed and studied a customized AI training program for users with varied backgrounds, which could be delivered across large organizations.  ( 11 min )
  • Open

    Constrained Langevin Algorithms with L-mixing External Random Variables. (arXiv:2205.14192v2 [cs.LG] UPDATED)
    Langevin algorithms are gradient descent methods augmented with additive noise, and are widely used in Markov Chain Monte Carlo (MCMC) sampling, optimization, and machine learning. In recent years, the non-asymptotic analysis of Langevin algorithms for non-convex learning has been extensively explored. For constrained problems with non-convex losses over a compact convex domain with IID data variables, the projected Langevin algorithm achieves a deviation of $O(T^{-1/4} (\log T)^{1/2})$ from its target distribution [27] in $1$-Wasserstein distance. In this paper, we obtain a deviation of $O(T^{-1/2} \log T)$ in $1$-Wasserstein distance for non-convex losses with $L$-mixing data variables and polyhedral constraints (which are not necessarily bounded). This improves on the previous bound for constrained problems and matches the best-known bound for unconstrained problems.
    A Semi-supervised Approach for Activity Recognition from Indoor Trajectory Data. (arXiv:2301.03134v1 [cs.LG])
    The increasingly wide usage of location aware sensors has made it possible to collect large volume of trajectory data in diverse application domains. Machine learning allows to study the activities or behaviours of moving objects (e.g., people, vehicles, robot) using such trajectory data with rich spatiotemporal information to facilitate informed strategic and operational decision making. In this study, we consider the task of classifying the activities of moving objects from their noisy indoor trajectory data in a collaborative manufacturing environment. Activity recognition can help manufacturing companies to develop appropriate management policies, and optimise safety, productivity, and efficiency. We present a semi-supervised machine learning approach that first applies an information theoretic criterion to partition a long trajectory into a set of segments such that the object exhibits homogeneous behaviour within each segment. The segments are then labelled automatically based on a constrained hierarchical clustering method. Finally, a deep learning classification model based on convolutional neural networks is trained on trajectory segments and the generated pseudo labels. The proposed approach has been evaluated on a dataset containing indoor trajectories of multiple workers collected from a tricycle assembly workshop. The proposed approach is shown to achieve high classification accuracy (F-score varies between 0.81 to 0.95 for different trajectories) using only a small proportion of labelled trajectory segments.
    Batch Bayesian Optimization via Particle Gradient Flows. (arXiv:2209.04722v2 [stat.ML] UPDATED)
    Bayesian Optimisation (BO) methods seek to find global optima of objective functions which are only available as a black-box or are expensive to evaluate. Such methods construct a surrogate model for the objective function, quantifying the uncertainty in that surrogate through Bayesian inference. Objective evaluations are sequentially determined by maximising an acquisition function at each step. However, this ancilliary optimisation problem can be highly non-trivial to solve, due to the non-convexity of the acquisition function, particularly in the case of batch Bayesian optimisation, where multiple points are selected in every step. In this work we reformulate batch BO as an optimisation problem over the space of probability measures. We construct a new acquisition function based on multipoint expected improvement which is convex over the space of probability measures. Practical schemes for solving this `inner' optimisation problem arise naturally as gradient flows of this objective function. We demonstrate the efficacy of this new method on different benchmark functions and compare with state-of-the-art batch BO methods.
    Non-intrusive Water Usage Classification Considering Limited Training Data. (arXiv:2301.03457v1 [eess.SP])
    Smart metering of domestic water consumption to continuously monitor the usage of different appliances has been shown to have an impact on people's behavior towards water conservation. However, the installation of multiple sensors to monitor each appliance currently has a high initial cost and as a result, monitoring consumption from different appliances using sensors is not cost-effective. To address this challenge, studies have focused on analyzing measurements of the total domestic consumption using Machine Learning (ML) methods, to disaggregate water usage into each appliance. Identifying which appliances are in use through ML is challenging since their operation may be overlapping, while specific appliances may operate with intermittent flow, making individual consumption events hard to distinguish. Moreover, ML approaches require large amounts of labeled input data to train their models, which are typically not available for a single household, while usage characteristics may vary in different regions. In this work, we initially propose a data model that generates synthetic time series based on regional water usage characteristics and resolution to overcome the need for a large training dataset with real labeled data. The method requires a small number of real labeled data from the studied region. Following this, we propose a new algorithm for classifying single and overlapping household water usage events, using the total domestic consumption measurements.
    Non-inferiority of Deep Learning Acute Ischemic Stroke Segmentation on Non-Contrast CT Compared to Expert Neuroradiologists. (arXiv:2211.15341v2 [eess.IV] UPDATED)
    Purpose: To show a deep learning model that segments acute ischemic stroke on NCCT at a level comparable to neuroradiologists. Materials and Methods: This included 227 Head NCCT examinations from 200 patients enrolled in the multi-center DEFUSE 3 trial. Three experienced neuroradiologists independently segmented the acute infarct on each scan. The neuroradiologists were divided into training experts (A) and test experts (B and C). The dataset was randomly split, by patient, into 5 folds with training and validation cases. A 3D deep Convolutional Neural Network (CNN) architecture was trained and optimized to predict the segmentations of expert A from NCCT. The performance of the model was assessed using a set of volume, overlap, and distance metrics. The optimized model was compared to the test experts B and C. We used a one-sided Wilcoxon signed-rank test to test for the non-inferiority of the model-expert compared to the inter-expert agreement. Results: The model-expert agreement was non-inferior to the inter-expert agreement as evaluated with a paired one-sided test procedure for differences in medians with lower boundaries of 10%, 2ml, and 5mm, p < 0.05, n=160. Conclusion: The 3d CNN trained on one neuroradiologist generalizes to acute ischemic stroke segmentation on NCCT of other neuroradiologists.
    Towards Less Constrained Macro-Neural Architecture Search. (arXiv:2203.05508v2 [cs.CV] UPDATED)
    Networks found with Neural Architecture Search (NAS) achieve state-of-the-art performance in a variety of tasks, out-performing human-designed networks. However, most NAS methods heavily rely on human-defined assumptions that constrain the search: architecture's outer-skeletons, number of layers, parameter heuristics and search spaces. Additionally, common search spaces consist of repeatable modules (cells) instead of fully exploring the architecture's search space by designing entire architectures (macro-search). Imposing such constraints requires deep human expertise and restricts the search to pre-defined settings. In this paper, we propose LCMNAS, a method that pushes NAS to less constrained search spaces by performing macro-search without relying on pre-defined heuristics or bounded search spaces. LCMNAS introduces three components for the NAS pipeline: i) a method that leverages information about well-known architectures to autonomously generate complex search spaces based on Weighted Directed Graphs with hidden properties, ii) an evolutionary search strategy that generates complete architectures from scratch, and iii) a mixed-performance estimation approach that combines information about architectures at initialization stage and lower fidelity estimates to infer their trainability and capacity to model complex functions. We present experiments in 13 different data sets showing that LCMNAS is capable of generating both cell and macro-based architectures with minimal GPU computation and state-of-the-art results. More, we conduct extensive studies on the importance of different NAS components in both cell and macro-based settings. Code for reproducibility is public at https://github.com/VascoLopes/LCMNAS.
    Most Activation Functions Can Win the Lottery Without Excessive Depth. (arXiv:2205.02321v2 [cs.LG] UPDATED)
    The strong lottery ticket hypothesis has highlighted the potential for training deep neural networks by pruning, which has inspired interesting practical and theoretical insights into how neural networks can represent functions. For networks with ReLU activation functions, it has been proven that a target network with depth $L$ can be approximated by the subnetwork of a randomly initialized neural network that has double the target's depth $2L$ and is wider by a logarithmic factor. We show that a depth $L+1$ network is sufficient. This result indicates that we can expect to find lottery tickets at realistic, commonly used depths while only requiring logarithmic overparametrization. Our novel construction approach applies to a large class of activation functions and is not limited to ReLUs.
    pyKT: A Python Library to Benchmark Deep Learning based Knowledge Tracing Models. (arXiv:2206.11460v5 [cs.LG] UPDATED)
    Knowledge tracing (KT) is the task of using students' historical learning interaction data to model their knowledge mastery over time so as to make predictions on their future interaction performance. Recently, remarkable progress has been made of using various deep learning techniques to solve the KT problem. However, the success behind deep learning based knowledge tracing (DLKT) approaches is still left somewhat unknown and proper measurement and analysis of these DLKT approaches remain a challenge. First, data preprocessing procedures in existing works are often private and custom, which limits experimental standardization. Furthermore, existing DLKT studies often differ in terms of the evaluation protocol and are far away real-world educational contexts. To address these problems, we introduce a comprehensive python based benchmark platform, \textsc{pyKT}, to guarantee valid comparisons across DLKT methods via thorough evaluations. The \textsc{pyKT} library consists of a standardized set of integrated data preprocessing procedures on 7 popular datasets across different domains, and 10 frequently compared DLKT model implementations for transparent experiments. Results from our fine-grained and rigorous empirical KT studies yield a set of observations and suggestions for effective DLKT, e.g., wrong evaluation setting may cause label leakage that generally leads to performance inflation; and the improvement of many DLKT approaches is minimal compared to the very first DLKT model proposed by Piech et al. \cite{piech2015deep}. We have open sourced \textsc{pyKT} and our experimental results at https://pykt.org/. We welcome contributions from other research groups and practitioners.
    Data-driven reduced order models using invariant foliations, manifolds and autoencoders. (arXiv:2206.12269v2 [math.DS] UPDATED)
    This paper explores how to identify a reduced order model (ROM) from a physical system. There are two distinct scenarios: the data collection and model identification either influence each other (closed-loop) or not (open-loop, off-line data). A ROM captures an invariant subset of the observed dynamics. We find that there are four ways a physical system can be related to a mathematical model: invariant foliations, invariant manifolds, autoencoders and equation-free models. Identification of invariant manifolds and equation-free models require closed-loop manipulation of the system. Invariant foliations and autoencoders can also use off-line data. Only invariant foliations and invariant manifolds can identify ROMs, the rest identify complete models. Therefore, the common case of identifying a ROM from existing data can only be achieved using invariant foliations. Finding an invariant foliation requires approximating high-dimensional functions. For function approximation, we use polynomials with compressed tensor coefficients, whose complexity increases linearly with increasing dimensions. An invariant manifold can also be found as the fixed leaf of a foliation. This only requires us to resolve the foliation in a small neighbourhood of the invariant manifold, which greatly simplifies the process. Combining an invariant foliation with the corresponding invariant manifold provides an accurate ROM. We analyse the ROM in case of a focus type equilibrium, typical in mechanical systems. The nonlinear coordinate system defined by the invariant foliation or the invariant manifold distorts instantaneous frequencies and damping ratios, which we correct. Through examples we illustrate the calculation of invariant foliations and manifolds, and at the same time show that Koopman eigenfunctions and autoencoders fail to capture accurate ROMs under the same conditions.
    Expressing linear equality constraints in feedforward neural networks. (arXiv:2211.04395v2 [cs.LG] UPDATED)
    We seek to impose linear, equality constraints in feedforward neural networks. As top layer predictors are usually nonlinear, this is a difficult task if we seek to deploy standard convex optimization methods and strong duality. To overcome this, we introduce a new saddle-point Lagrangian with auxiliary predictor variables on which constraints are imposed. Elimination of the auxiliary variables leads to a dual minimization problem on the Lagrange multipliers introduced to satisfy the linear constraints. This minimization problem is combined with the standard learning problem on the weight matrices. From this theoretical line of development, we obtain the surprising interpretation of Lagrange parameters as additional, penultimate layer hidden units with fixed weights stemming from the constraints. Consequently, standard minimization approaches can be used despite the inclusion of Lagrange parameters -- a very satisfying, albeit unexpected, discovery. Examples ranging from multi-label classification to constrained autoencoders are envisaged in the future. The code has been made available at https://github.com/anandrajan0/smartalec
    DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Generation Models. (arXiv:2210.06998v2 [cs.CR] UPDATED)
    Text-to-image generation models that generate images based on prompt descriptions have attracted an increasing amount of attention during the past few months. Despite their encouraging performance, these models raise concerns about the misuse of their generated fake images. To tackle this problem, we pioneer a systematic study on the detection and attribution of fake images generated by text-to-image generation models. Concretely, we first build a machine learning classifier to detect the fake images generated by various text-to-image generation models. We then attribute these fake images to their source models, such that model owners can be held responsible for their models' misuse. We further investigate how prompts that generate fake images affect detection and attribution. We conduct extensive experiments on four popular text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion, GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical results show that (1) fake images generated by various models can be distinguished from real ones, as there exists a common artifact shared by fake images from different models; (2) fake images can be effectively attributed to their source models, as different models leave unique fingerprints in their generated images; (3) prompts with the ``person'' topic or a length between 25 and 75 enable models to generate fake images with higher authenticity. All findings contribute to the community's insight into the threats caused by text-to-image generation models. We appeal to the community's consideration of the counterpart solutions, like ours, against the rapidly-evolving fake image generation.
    Rethinking Value Function Learning for Generalization in Reinforcement Learning. (arXiv:2210.09960v2 [cs.LG] UPDATED)
    Our work focuses on training RL agents on multiple visually diverse environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that a value network in the multi-environment setting is more challenging to optimize and prone to memorizing the training data than in the conventional single-environment setting. In addition, we find that appropriate regularization on the value network is necessary to improve both training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), a policy gradient algorithm that implicitly penalizes value estimates by optimizing the value network less frequently with more training data than the policy network. This can be implemented using a single unified network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency on the Procgen Benchmark.
    Beyond calibration: estimating the grouping loss of modern neural networks. (arXiv:2210.16315v2 [cs.LG] UPDATED)
    The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.
    L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. (arXiv:2211.11418v4 [cs.CL] UPDATED)
    The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks. We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. Further, since Indic languages, Hindi and Marathi share the Devanagari script, we train a single model for both languages. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets. We evaluate these models on downstream Hindi and Marathi text classification and named entity recognition tasks. The HindBERT and DevBERT-based models show significant improvements over multi-lingual MuRIL, IndicBERT, and XLM-R. Based on these observations we also release monolingual BERT models for other Indic languages Kannada, Telugu, Malayalam, Tamil, Gujarati, Assamese, Odia, Bengali, and Punjabi. These models are shared at https://huggingface.co/l3cube-pune .
    Fast Algorithm for Constrained Linear Inverse Problems. (arXiv:2212.01068v5 [math.OC] UPDATED)
    We consider the constrained Linear Inverse Problem (LIP), where a certain atomic norm (like the $\ell_1 $ and the Nuclear norm) is minimized subject to a quadratic constraint. Typically, such cost functions are non-differentiable which makes them not amenable to the fast optimization methods existing in practice. We propose two equivalent reformulations of the constrained LIP with improved convex regularity: (i) a smooth convex minimization problem, and (ii) a strongly convex min-max problem. These problems could be solved by applying existing acceleration based convex optimization methods which provide better $ O \big( \frac{1}{k^2} \big) $ theoretical convergence guarantee. However, to fully exploit the utility of these reformulations, we also provide a novel algorithm, to which we refer as the Fast Linear Inverse Problem Solver (FLIPS), that is tailored to solve the reformulation of the LIP. We demonstrate the performance of FLIPS on the sparse coding problem arising in image processing tasks. In this setting, we observe that FLIPS consistently outperforms the Chambolle-Pock and C-SALSA algorithms--two of the current best methods in the literature.
    The mbsts package: Multivariate Bayesian Structural Time Series Models in R. (arXiv:2106.14045v2 [stat.ME] UPDATED)
    The multivariate Bayesian structural time series (MBSTS) model as a generalized version of many structural time series models, deals with inference and prediction for multiple correlated time series, where one also has the choice of using a different candidate pool of contemporaneous predictors for each target series. The MBSTS model has wide applications and is ideal for feature selection, time series forecasting, nowcasting, inferring causal impact, and others. This paper demonstrates how to use the R package mbsts for MBSTS modeling, establishing a bridge between user-friendly and developer-friendly functions in the package and the corresponding methodology. Object-oriented functions in the package are explained in the way that enables users to flexibly add or deduct some components, as well as to simplify or complicate some settings.
    Intelligence at the Extreme Edge: A Survey on Reformable TinyML. (arXiv:2204.00827v2 [cs.LG] UPDATED)
    Tiny Machine Learning (TinyML) is an upsurging research field that proposes to democratize the use of Machine Learning and Deep Learning on highly energy-efficient frugal Microcontroller Units. Considering the general assumption that TinyML can only run inference, growing interest in the domain has led to work that makes them reformable, i.e., solutions that permit models to improve once deployed. This work presents a survey on reformable TinyML solutions with the proposal of a novel taxonomy. Here, the suitability of each hierarchical layer for reformability is discussed. Furthermore, we explore the workflow of TinyML and analyze the identified deployment schemes, available tools and the scarcely available benchmarking tools. Finally, we discuss how reformable TinyML can impact a few selected industrial areas and discuss the challenges and future directions.
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v2 [q-bio.NC] UPDATED)
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.
    Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference. (arXiv:2207.11597v3 [cs.LG] UPDATED)
    We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound rather than a logarithmic lower bound, as shown by \cite{lattimore2017end}, in discrete (i.e., well-separated) action spaces. Furthermore, while the previous result is shown to hold only in the asymptotic regime (as $n \to \infty$), our result for these "locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary -- the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.
    Benchmarking Graphormer on Large-Scale Molecular Modeling Datasets. (arXiv:2203.04810v2 [cs.LG] UPDATED)
    This technical note describes the recent updates of Graphormer, including architecture design modifications, and the adaption to 3D molecular dynamics simulation. With these simple modifications, Graphormer could attain better results on large-scale molecular modeling datasets than the vanilla one, and the performance gain could be consistently obtained on 2D and 3D molecular graph modeling tasks. In addition, we show that with a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs. Empirically, Graphormer could achieve much less MAE than the originally reported results on the PCQM4M quantum chemistry dataset used in KDD Cup 2021. In the meanwhile, it greatly outperforms the competitors in the recent Open Catalyst Challenge, which is a competition track on NeurIPS 2021 workshop, and aims to model the catalyst-adsorbate reaction system with advanced AI models. All codes could be found at https://github.com/Microsoft/Graphormer.
    Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification. (arXiv:2205.13573v3 [cs.LG] UPDATED)
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to the high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called \textsc{Spar-GW}, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. The proposed \textsc{Spar-GW} method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $O(n^4)$ to $O(n^{2+\delta})$ for an arbitrary small $\delta>0$. Theoretically, the convergence and consistency of the proposed estimation for GW distance are established under mild regularity conditions. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our \textsc{Spar-GW} to state-of-the-art methods in both synthetic and real-world tasks.
    GSR: A Generalized Symbolic Regression Approach. (arXiv:2205.15569v2 [cs.LG] UPDATED)
    Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes, SR attempts to gain insight into the underlying relationships between the independent variables and the target variable of a given dataset by assembling analytical functions. In this paper, we present GSR, a Generalized Symbolic Regression approach, by modifying the conventional SR optimization problem formulation, while keeping the main SR objective intact. In GSR, we infer mathematical relationships between the independent variables and some transformation of the target variable. We constrain our search space to a weighted sum of basis functions, and propose a genetic programming approach with a matrix-based encoding scheme. We show that our GSR method is competitive with strong SR benchmark methods, achieving promising experimental performance on the well-known SR benchmark problem sets. Finally, we highlight the strengths of GSR by introducing SymSet, a new SR benchmark set which is more challenging relative to the existing benchmarks.
    A General Framework for Auditing Differentially Private Machine Learning. (arXiv:2210.08643v2 [cs.LG] UPDATED)
    We present a framework to statistically audit the privacy guarantee conferred by a differentially private machine learner in practice. While previous works have taken steps toward evaluating privacy loss through poisoning attacks or membership inference, they have been tailored to specific models or have demonstrated low statistical power. Our work develops a general methodology to empirically evaluate the privacy of differentially private machine learning implementations, combining improved privacy search and verification methods with a toolkit of influence-based poisoning attacks. We demonstrate significantly improved auditing power over previous approaches on a variety of models including logistic regression, Naive Bayes, and random forest. Our method can be used to detect privacy violations due to implementation errors or misuse. When violations are not present, it can aid in understanding the amount of information that can be leaked from a given dataset, algorithm, and privacy specification.
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v3 [cs.LG] UPDATED)
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
    Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. (arXiv:2208.11580v2 [cs.LG] UPDATED)
    We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] extended to also cover weight quantization at the scale of modern DNNs. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can enable the accurate compound application of both pruning and quantization in a post-training setting.
    Representative Image Feature Extraction via Contrastive Learning Pretraining for Chest X-ray Report Generation. (arXiv:2209.01604v2 [cs.CV] UPDATED)
    Medical report generation is a challenging task since it is time-consuming and requires expertise from experienced radiologists. The goal of medical report generation is to accurately capture and describe the image findings. Previous works pretrain their visual encoding neural networks with large datasets in different domains, which cannot learn general visual representation in the specific medical domain. In this work, we propose a medical report generation framework that uses a contrastive learning approach to pretrain the visual encoder and requires no additional meta information. In addition, we adopt lung segmentation as an augmentation method in the contrastive learning framework. This segmentation guides the network to focus on encoding the visual feature within the lung region. Experimental results show that the proposed framework improves the performance and the quality of the generated medical reports both quantitatively and qualitatively.
    Accelerating Transfer Learning with Near-Data Computation on Cloud Object Stores. (arXiv:2210.08650v2 [cs.LG] UPDATED)
    Near-data computation techniques have been successfully deployed to mitigate the cloud network bottleneck between the storage and compute tiers. At Huawei, we are currently looking to get more value from these techniques by broadening their applicability. Machine learning (ML) applications are an appealing and timely target. This paper describes our experience applying near-data computation techniques to transfer learning (TL), a widely popular ML technique, in the context of disaggregated cloud object stores. Our techniques benefit both cloud providers and users. They improve our operational efficiency while providing users the performance improvements they demand from us. The main practical challenge to consider is that the storage-side computational resources are limited. Our approach is to split the TL deep neural network (DNN) during the feature extraction phase, before the training phase. This reduces the network transfers to the compute tier and further decouples the batch size of feature extraction from the training batch size. This facilitates our second technique, storage-side batch adaptation, which enables increased concurrency in the storage tier while avoiding out-of-memory errors. Guided by these insights, we present HAPI, our processing system for TL that spans the compute and storage tiers while remaining transparent to the user. Our evaluation with several state-of-the-art DNNs, such as ResNet, VGG, and Transformer, shows up to 11x improvement in application runtime and up to 8.3x reduction in the data transferred from the storage to the compute tier compared to running the computation entirely in the compute tier.
    Exoplanet atmosphere evolution: emulation with neural networks. (arXiv:2110.15162v3 [astro-ph.EP] UPDATED)
    Atmospheric mass-loss is known to play a leading role in sculpting the demographics of small, close-in exoplanets. Knowledge of how such planets evolve allows one to ``rewind the clock'' to infer the conditions in which they formed. Here, we explore the relationship between a planet's core mass and their atmospheric mass after protoplanetary disc dispersal by exploiting XUV photoevaporation as an evolutionary process. Historically, this style of inference problem would be computationally infeasible due to the large number of planet models required; however, we make use of a novel atmospheric evolution emulator which utilises neural networks to provide three orders of magnitude in speedup. First, we provide proof-of-concept for this emulator on a real problem, by inferring the initial atmospheric conditions to the TOI-270 multi-planet system. Using the emulator we find near-indistinguishable results when compared to original model. We then apply the emulator to the more complex inference problem, which aims to find the initial conditions for a sample of \textit{Kepler}, \textit{K2} and \textit{TESS} planets with well-constrained masses and radii. We demonstrate there is a relationship between core masses and the atmospheric mass that they retain after disc dispersal, and this trend is consistent with the `boil-off' scenario, in which close-in planets undergo dramatic atmospheric escape during disc dispersal. Thus, it appears the exoplanet population is consistent with the idea that close-in exoplanets initially acquired large massive atmospheres, the majority of which is lost during disc dispersal; before the final population is sculpted by atmospheric loss over 100~Myr to Gyr timescales.
    GARNET: Reduced-Rank Topology Learning for Robust and Scalable Graph Neural Networks. (arXiv:2201.12741v6 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have been increasingly deployed in various applications that involve learning on non-Euclidean data. However, recent studies show that GNNs are vulnerable to graph adversarial attacks. Although there are several defense methods to improve GNN robustness by eliminating adversarial components, they may also impair the underlying clean graph structure that contributes to GNN training. In addition, few of those defense models can scale to large graphs due to their high computational complexity and memory usage. In this paper, we propose GARNET, a scalable spectral method to boost the adversarial robustness of GNN models. GARNET first leverages weighted spectral embedding to construct a base graph, which is not only resistant to adversarial attacks but also contains critical (clean) graph structure for GNN training. Next, GARNET further refines the base graph by pruning additional uncritical edges based on probabilistic graphical model. GARNET has been evaluated on various datasets, including a large graph with millions of nodes. Our extensive experiment results show that GARNET achieves adversarial accuracy improvement and runtime speedup over state-of-the-art GNN (defense) models by up to 13.27% and 14.7x, respectively.
    Verifying Learning-Based Robotic Navigation Systems. (arXiv:2205.13536v2 [cs.RO] UPDATED)
    Deep reinforcement learning (DRL) has become a dominant deep-learning paradigm for tasks where complex policies are learned within reactive systems. Unfortunately, these policies are known to be susceptible to bugs. Despite significant progress in DNN verification, there has been little work demonstrating the use of modern verification tools on real-world, DRL-controlled systems. In this case study, we attempt to begin bridging this gap, and focus on the important task of mapless robotic navigation -- a classic robotics problem, in which a robot, usually controlled by a DRL agent, needs to efficiently and safely navigate through an unknown arena towards a target. We demonstrate how modern verification engines can be used for effective model selection, i.e., selecting the best available policy for the robot in question from a pool of candidate policies. Specifically, we use verification to detect and rule out policies that may demonstrate suboptimal behavior, such as collisions and infinite loops. We also apply verification to identify models with overly conservative behavior, thus allowing users to choose superior policies, which might be better at finding shorter paths to a target. To validate our work, we conducted extensive experiments on an actual robot, and confirmed that the suboptimal policies detected by our method were indeed flawed. We also demonstrate the superiority of our verification-driven approach over state-of-the-art, gradient attacks. Our work is the first to establish the usefulness of DNN verification in identifying and filtering out suboptimal DRL policies in real-world robots, and we believe that the methods presented here are applicable to a wide range of systems that incorporate deep-learning-based agents.
    OpenCon: Open-world Contrastive Learning. (arXiv:2208.02764v2 [cs.LG] UPDATED)
    Machine learning models deployed in the wild naturally encounter unlabeled samples from both known and novel classes. Challenges arise in learning from both the labeled and unlabeled data, in an open-world semi-supervised manner. In this paper, we introduce a new learning framework, open-world contrastive learning (OpenCon). OpenCon tackles the challenges of learning compact representations for both known and novel classes and facilitates novelty discovery along the way. We demonstrate the effectiveness of OpenCon on challenging benchmark datasets and establish competitive performance. On the ImageNet dataset, OpenCon significantly outperforms the current best method by 11.9% and 7.4% on novel and overall classification accuracy, respectively. Theoretically, OpenCon can be rigorously interpreted from an EM algorithm perspective--minimizing our contrastive loss partially maximizes the likelihood by clustering similar samples in the embedding space. The code is available at https://github.com/deeplearning-wisc/opencon.
    Unraveling the graph structure of tabular data through Bayesian and spectral analysis. (arXiv:2110.01421v2 [cs.LG] UPDATED)
    In the big-data age, tabular data are being generated and analyzed everywhere. As a consequence, finding and understanding the relationships between the features in these data are of great relevance. Here, to encompass these relationships, we propose a graph-based method that allows individual, group and multi-scale analyses. The method starts by mapping the tabular data into a weighted directed graph using the Shapley additive explanations technique. With this graph of relationships, we show that the inference of the hierarchical modular structure obtained by the Nested Stochastic Block Model (nSBM) as well as the study of the spectral space of the magnetic Laplacian can help us identify the classes of features and unravel non-trivial relationships. As a case study, we analyzed a socioeconomic survey conducted with students in Brazil: the PeNSE survey. The spectral embedding of the columns suggested that questions related to physical activities form a separate group. The application of the nSBM approach not only corroborated with that but allowed complementary findings about the modular structure: some groups of questions showed a high adherence with the divisions qualitatively defined by the designers of the survey. As opposed to the structure obtained by the spectrum, questions from the class Safety were partly grouped by our method in the class Drugs. Surprisingly, by inspecting these questions, we observed that they were related to both these topics, suggesting an alternative interpretation of these questions. These results show how our method can provide guidance for tabular data analysis as well as the design of future surveys.
    Investigations on convergence behaviour of Physics Informed Neural Networks across spectral ranges and derivative orders. (arXiv:2301.02790v1 [cs.LG])
    An important inference from Neural Tangent Kernel (NTK) theory is the existence of spectral bias (SB), that is, low frequency components of the target function of a fully connected Artificial Neural Network (ANN) being learnt significantly faster than the higher frequencies during training. This is established for Mean Square Error (MSE) loss functions with very low learning rate parameters. Physics Informed Neural Networks (PINNs) are designed to learn the solutions of differential equations (DE) of arbitrary orders; in PINNs the loss functions are obtained as the residues of the conservative form of the DEs and represent the degree of dissatisfaction of the equations. So there has been an open question whether (a) PINNs also exhibit SB and (b) if so, how does this bias vary across the orders of the DEs. In this work, a series of numerical experiments are conducted on simple sinusoidal functions of varying frequencies, compositions and equation orders to investigate these issues. It is firmly established that under normalized conditions, PINNs do exhibit strong spectral bias, and this increases with the order of the differential equation.
    Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling. (arXiv:2301.03580v1 [cs.CV])
    We identify and overcome two key obstacles in extending the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets): (i) convolution operation cannot handle irregular, random-masked input images; (ii) the single-scale nature of BERT pre-training is inconsistent with convnet's hierarchical structure. For (i), we treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode. This is the first use of sparse convolution for 2D masked modeling. For (ii), we develop a hierarchical decoder to reconstruct images from multi-scale encoded features. Our method called Sparse masKed modeling (SparK) is general: it can be used directly on any convolutional model without backbone modifications. We validate it on both classical (ResNet) and modern (ConvNeXt) models: on three downstream tasks, it surpasses both state-of-the-art contrastive learning and transformer-based masked modeling by similarly large margins (around +1.0%). Improvements on object detection and instance segmentation are more substantial (up to +3.5%), verifying the strong transferability of features learned. We also find its favorable scaling behavior by observing more gains on larger models. All this evidence reveals a promising future of generative pre-training on convnets. Codes and models are released at https://github.com/keyu-tian/SparK.
    Exponential Family Model-Based Reinforcement Learning via Score Matching. (arXiv:2112.14195v2 [cs.LG] UPDATED)
    We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Under standard regularity assumptions, SMRL achieves $\tilde O(d\sqrt{H^3T})$ online regret, where $H$ is the length of each episode and $T$ is the total number of interactions (ignoring polynomial dependence on structural scale parameters).
    Neuromorphic Wireless Cognition: Event-Driven Semantic Communications for Remote Inference. (arXiv:2206.06047v2 [cs.IT] UPDATED)
    Neuromorphic computing is an emerging computing paradigm that moves away from batched processing towards the online, event-driven, processing of streaming data. Neuromorphic chips, when coupled with spike-based sensors, can inherently adapt to the "semantics" of the data distribution by consuming energy only when relevant events are recorded in the timing of spikes and by proving a low-latency response to changing conditions in the environment. This paper proposes an end-to-end design for a neuromorphic wireless Internet-of-Things system that integrates spike-based sensing, processing, and communication. In the proposed NeuroComm system, each sensing device is equipped with a neuromorphic sensor, a spiking neural network (SNN), and an impulse radio transmitter with multiple antennas. Transmission takes place over a shared fading channel to a receiver equipped with a multi-antenna impulse radio receiver and with an SNN. In order to enable adaptation of the receiver to the fading channel conditions, we introduce a hypernetwork to control the weights of the decoding SNN using pilots. Pilots, encoding SNNs, decoding SNN, and hypernetwork are jointly trained across multiple channel realizations. The proposed system is shown to significantly improve over conventional frame-based digital solutions, as well as over alternative non-adaptive training methods, in terms of time-to-accuracy and energy consumption metrics.
    IAN: Iterated Adaptive Neighborhoods for manifold learning and dimensionality estimation. (arXiv:2208.09123v3 [cs.LG] UPDATED)
    Invoking the manifold assumption in machine learning requires knowledge of the manifold's geometry and dimension, and theory dictates how many samples are required. However, in applications data are limited, sampling may not be uniform, and manifold properties are unknown and (possibly) non-pure; this implies that neighborhoods must adapt to the local structure. We introduce an algorithm for inferring adaptive neighborhoods for data given by a similarity kernel. Starting with a locally-conservative neighborhood (Gabriel) graph, we sparsify it iteratively according to a weighted counterpart. In each step, a linear program yields minimal neighborhoods globally and a volumetric statistic reveals neighbor outliers likely to violate manifold geometry. We apply our adaptive neighborhoods to non-linear dimensionality reduction, geodesic computation and dimension estimation. A comparison against standard algorithms using, e.g., k-nearest neighbors, demonstrates their usefulness. Code for our algorithm will be available at https://github.com/dyballa/IAN
    Reservoir Prediction by Machine Learning Methods on The Well Data and Seismic Attributes for Complex Coastal Conditions. (arXiv:2301.03216v1 [physics.geo-ph])
    The aim of this work was to predict the probability of the spread of rock formations with hydrocarbon-collecting properties in the studied coastal area using a stack of machine learning algorithms and data augmentation and modification methods. This research develops the direction of machine learning where training is conducted on well data and spatial attributes. Methods for overcoming the limitations of this direction are shown, two methods - augmentation and modification of the well data sample: Spindle and Revers-Calibration. Considering the difficulties for seismic data interpretation in coastal area conditions, the proposed approach is a tool which is able to work with the whole totality of geological and geophysical data, extract the knowledge from 159-dimensional space spatial attributes and make facies spreading prediction with acceptable quality - F1 measure for reservoir class 0.798 on average for evaluation of "drilling" results of different geological conditions. It was shown that consistent application of the proposed augmentation methods in the implemented technology stack improves the quality of reservoir prediction by a factor of 1.56 relative to the original dataset.
    Topologically Regularized Data Embeddings. (arXiv:2301.03338v1 [cs.LG])
    Unsupervised representation learning methods are widely used for gaining insight into high-dimensional, unstructured, or structured data. In some cases, users may have prior topological knowledge about the data, such as a known cluster structure or the fact that the data is known to lie along a tree- or graph-structured topology. However, generic methods to ensure such structure is salient in the low-dimensional representations are lacking. This negatively impacts the interpretability of low-dimensional embeddings, and plausibly downstream learning tasks. To address this issue, we introduce topological regularization: a generic approach based on algebraic topology to incorporate topological prior knowledge into low-dimensional embeddings. We introduce a class of topological loss functions, and show that jointly optimizing an embedding loss with such a topological loss function as a regularizer yields embeddings that reflect not only local proximities but also the desired topological structure. We include a self-contained overview of the required foundational concepts in algebraic topology, and provide intuitive guidance on how to design topological loss functions for a variety of shapes, such as clusters, cycles, and bifurcations. We empirically evaluate the proposed approach on computational efficiency, robustness, and versatility in combination with linear and non-linear dimensionality reduction and graph embedding methods.
    UB3: Best Beam Identification in Millimeter Wave Systems via Pure Exploration Unimodal Bandits. (arXiv:2301.03456v1 [eess.SP])
    Millimeter wave (mmWave) communications have a broad spectrum and can support data rates in the order of gigabits per second, as envisioned in 5G systems. However, they cannot be used for long distances due to their sensitivity to attenuation loss. To enable their use in the 5G network, it requires that the transmission energy be focused in sharp pencil beams. As any misalignment between the transmitter and receiver beam pair can reduce the data rate significantly, it is important that they are aligned as much as possible. To find the best transmit-receive beam pair, recent beam alignment (BA) techniques examine the entire beam space, which might result in a large amount of BA latency. Recent works propose to adaptively select the beams such that the cumulative reward measured in terms of received signal strength or throughput is maximized. In this paper, we develop an algorithm that exploits the unimodal structure of the received signal strengths of the beams to identify the best beam in a finite time using pure exploration strategies. Strategies that identify the best beam in a fixed time slot are more suitable for wireless network protocol design than cumulative reward maximization strategies that continuously perform exploration and exploitation. Our algorithm is named Unimodal Bandit for Best Beam (UB3) and identifies the best beam with a high probability in a few rounds. We prove that the error exponent in the probability does not depend on the number of beams and show that this is indeed the case by establishing a lower bound for the unimodal bandits. We demonstrate that UB3 outperforms the state-of-the-art algorithms through extensive simulations. Moreover, our algorithm is simple to implement and has lower computational complexity.
    A Domain-Theoretic Framework for Robustness Analysis of Neural Networks. (arXiv:2203.00295v3 [cs.LG] UPDATED)
    A domain-theoretic framework is presented for validated robustness analysis of neural networks. First, global robustness of a general class of networks is analyzed. Then, using the fact that Edalat's domain-theoretic L-derivative coincides with Clarke's generalized gradient, the framework is extended for attack-agnostic local robustness analysis. The proposed framework is ideal for designing algorithms which are correct by construction. This claim is exemplified by developing a validated algorithm for estimation of Lipschitz constant of feedforward regressors. The completeness of the algorithm is proved over differentiable networks, and also over general position ReLU networks. Computability results are obtained within the framework of effectively given domains. Using the proposed domain model, differentiable and non-differentiable networks can be analyzed uniformly. The validated algorithm is implemented using arbitrary-precision interval arithmetic, and the results of some experiments are presented. The software implementation is truly validated, as it handles floating-point errors as well.
    Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods. (arXiv:2205.01303v3 [stat.ML] UPDATED)
    In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a "converse" Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).
    Reinforcement Learning for Joint Optimization of Multiple Rewards. (arXiv:1909.02940v4 [cs.LG] UPDATED)
    Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.
    Hierarchical Federated Learning with Quantization: Convergence Analysis and System Design. (arXiv:2103.14272v2 [cs.LG] UPDATED)
    Federated learning (FL) is a powerful distributed machine learning framework where a server aggregates models trained by different clients without accessing their private data. Hierarchical FL, with a client-edge-cloud aggregation hierarchy, can effectively leverage both the cloud server's access to many clients' data and the edge servers' closeness to the clients to achieve a high communication efficiency. Neural network quantization can further reduce the communication overhead during model uploading. To fully exploit the advantages of hierarchical FL, an accurate convergence analysis with respect to the key system parameters is needed. Unfortunately, existing analysis is loose and does not consider model quantization. In this paper, we derive a tighter convergence bound for hierarchical FL with quantization. The convergence result leads to practical guidelines for important design problems such as the client-edge aggregation and edge-client association strategies. Based on the obtained analytical results, we optimize the two aggregation intervals and show that the client-edge aggregation interval should slowly decay while the edge-cloud aggregation interval needs to adapt to the ratio of the client-edge and edge-cloud propagation delay. Simulation results shall verify the design guidelines and demonstrate the effectiveness of the proposed aggregation strategy.
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation. (arXiv:2203.05400v3 [math.ST] UPDATED)
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, including the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood estimate of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 + d/2$, then the smoothness parameter estimate cannot be asymptotically less than $\nu_0 + d/2$. The lower bound is sharp. Additionally, we show that maximum likelihood estimation finds the "correct" smoothness for a class of compactly supported self-similar functions. We also consider cross-validation and prove an asymptotic lower bound $\nu_0$, which however is unlikely to be sharp. The results are based on approximation theory in Sobolev spaces and some general theorems that restrict the set of values that the parameter estimators can take.
    Self-mentoring: a new deep learning pipeline to train a self-supervised U-net for few-shot learning of bio-artificial capsule segmentation. (arXiv:2205.10840v3 [cs.CV] UPDATED)
    Background: Accurate segmentation of microscopic structures such as bio-artificial capsules in microscopy imaging is a prerequisite to the computer-aided understanding of important biomechanical phenomenons. State-of-the-art segmentation performances are achieved by deep neural networks and related data-driven approaches. Training these networks from only a few annotated examples is challenging while producing manually annotated images that provide supervision is tedious. Method: Recently, self-supervision, i.e. designing a neural pipeline providing synthetic or indirect supervision, has proved to significantly increase generalization performances of models trained on few shots. The objective of this paper is to introduce one such neural pipeline in the context of micro-capsule image segmentation. Our method leverages the rather simple content of these images so that a trainee network can be mentored by a referee network which has been previously trained on synthetically generated pairs of corrupted/correct region masks. Results: Challenging experimental setups are investigated. They involve from only 3 to 10 annotated images along with moderately large amounts of unannotated images. In a bio-artificial capsule dataset, our approach consistently and drastically improves accuracy. We also show that the learnt referee network is transferable to another Glioblastoma cell dataset and that it can be efficiently coupled with data augmentation strategies. Conclusions: Experimental results show that very significant accuracy increments are obtained by the proposed pipeline, leading to the conclusion that the self-supervision mechanism introduced in this paper has the potential to replace human annotations.
    Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction. (arXiv:2301.03573v1 [cs.LG])
    Despite impressive performance on a wide variety of tasks, deep neural networks require significant memory and computation costs, prohibiting their application in resource-constrained scenarios. Sparse training is one of the most common techniques to reduce these costs, however, the sparsity constraints add difficulty to the optimization, resulting in an increase in training time and instability. In this work, we aim to overcome this problem and achieve space-time co-efficiency. To accelerate and stabilize the convergence of sparse training, we analyze the gradient changes and develop an adaptive gradient correction method. Specifically, we approximate the correlation between the current and previous gradients, which is used to balance the two gradients to obtain a corrected gradient. Our method can be used with most popular sparse training pipelines under both standard and adversarial setups. Theoretically, we prove that our method can accelerate the convergence rate of sparse training. Extensive experiments on multiple datasets, model architectures, and sparsities demonstrate that our method outperforms leading sparse training methods by up to \textbf{5.0\%} in accuracy given the same number of training epochs, and reduces the number of training epochs by up to \textbf{52.1\%} to achieve the same accuracy.
    A Comprehensive Taxonomy for Explainable Artificial Intelligence: A Systematic Survey of Surveys on Methods and Concepts. (arXiv:2105.07190v4 [cs.LG] UPDATED)
    In the meantime, a wide variety of terminologies, motivations, approaches, and evaluation criteria have been developed within the research field of explainable artificial intelligence (XAI). With the amount of XAI methods vastly growing, a taxonomy of methods is needed by researchers as well as practitioners: To grasp the breadth of the topic, compare methods, and to select the right XAI method based on traits required by a specific use-case context. Many taxonomies for XAI methods of varying level of detail and depth can be found in the literature. While they often have a different focus, they also exhibit many points of overlap. This paper unifies these efforts and provides a complete taxonomy of XAI methods with respect to notions present in the current state of research. In a structured literature analysis and meta-study, we identified and reviewed more than 50 of the most cited and current surveys on XAI methods, metrics, and method traits. After summarizing them in a survey of surveys, we merge terminologies and concepts of the articles into a unified structured taxonomy. Single concepts therein are illustrated by more than 50 diverse selected example methods in total, which we categorize accordingly. The taxonomy may serve both beginners, researchers, and practitioners as a reference and wide-ranging overview of XAI method traits and aspects. Hence, it provides foundations for targeted, use-case-oriented, and context-sensitive future research.
    Robust Feature-Level Adversaries are Interpretability Tools. (arXiv:2110.03605v6 [cs.LG] UPDATED)
    The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv
    SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain. (arXiv:2301.02966v1 [cs.CL])
    This paper introduces SpeeChain, an open-source Pytorch-based toolkit designed to develop the machine speech chain for large-scale use. This first release focuses on the TTS-to-ASR chain, a core component of the machine speech chain, that refers to the TTS data augmentation by unspoken text for ASR. To build an efficient pipeline for the large-scale TTS-to-ASR chain, we implement easy-to-use multi-GPU batch-level model inference, multi-dataloader batch generation, and on-the-fly data selection techniques. In this paper, we first explain the overall procedure of the TTS-to-ASR chain and the difficulties of each step. Then, we present a detailed ablation study on different types of unlabeled data, data filtering thresholds, batch composition, and real-synthetic data ratios. Our experimental results on train_clean_460 of LibriSpeech demonstrate that our TTS-to-ASR chain can significantly improve WER in a semi-supervised setting.
    Nuclear Segmentation and Classification: On Color & Compression Generalization. (arXiv:2301.03418v1 [eess.IV])
    Since the introduction of digital and computational pathology as a field, one of the major problems in the clinical application of algorithms has been the struggle to generalize well to examples outside the distribution of the training data. Existing work to address this in both pathology and natural images has focused almost exclusively on classification tasks. We explore and evaluate the robustness of the 7 best performing nuclear segmentation and classification models from the largest computational pathology challenge for this problem to date, the CoNIC challenge. We demonstrate that existing state-of-the-art (SoTA) models are robust towards compression artifacts but suffer substantial performance reduction when subjected to shifts in the color domain. We find that using stain normalization to address the domain shift problem can be detrimental to the model performance. On the other hand, neural style transfer is more consistent in improving test performance when presented with large color variations in the wild.
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v4 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.
    Diffusion models as plug-and-play priors. (arXiv:2206.09012v3 [cs.LG] UPDATED)
    We consider the problem of inferring high-dimensional data $\mathbf{x}$ in a model that consists of a prior $p(\mathbf{x})$ and an auxiliary differentiable constraint $c(\mathbf{x},\mathbf{y})$ on $x$ given some additional information $\mathbf{y}$. In this paper, the prior is an independently trained denoising diffusion generative model. The auxiliary constraint is expected to have a differentiable form, but can come from diverse sources. The possibility of such inference turns diffusion models into plug-and-play modules, thereby allowing a range of potential applications in adapting models to new domains and tasks, such as conditional generation or image segmentation. The structure of diffusion models allows us to perform approximate inference by iterating differentiation through the fixed denoising network enriched with different amounts of noise at each step. Considering many noised versions of $\mathbf{x}$ in evaluation of its fitness is a novel search mechanism that may lead to new algorithms for solving combinatorial optimization problems.
    Automatic Differentiation of Programs with Discrete Randomness. (arXiv:2210.08572v3 [cs.LG] UPDATED)
    Automatic differentiation (AD), a technique for constructing new programs which compute the derivative of an original program, has become ubiquitous throughout scientific computing and deep learning due to the improved performance afforded by gradient-based optimization. However, AD systems have been restricted to the subset of programs that have a continuous dependence on parameters. Programs that have discrete stochastic behaviors governed by distribution parameters, such as flipping a coin with probability $p$ of being heads, pose a challenge to these systems because the connection between the result (heads vs tails) and the parameters ($p$) is fundamentally discrete. In this paper we develop a new reparameterization-based methodology that allows for generating programs whose expectation is the derivative of the expectation of the original program. We showcase how this method gives an unbiased and low-variance estimator which is as automated as traditional AD mechanisms. We demonstrate unbiased forward-mode AD of discrete-time Markov chains, agent-based models such as Conway's Game of Life, and unbiased reverse-mode AD of a particle filter. Our code package is available at https://github.com/gaurav-arya/StochasticAD.jl.
    ExcelFormer: A Neural Network Surpassing GBDTs on Tabular Data. (arXiv:2301.02819v1 [cs.LG])
    Though neural networks have achieved enormous breakthroughs on various fields (e.g., computer vision) in supervised learning, they still trailed the performances of GBDTs on tabular data thus far. Delving into this issue, we identify that a proper handling of feature interactions and feature embedding is crucial to the success of neural networks on tabular data. We develop a novel neural network called ExcelFormer, which alternates in turn two attention modules that respectively manipulate careful feature interactions and feature embedding updates. A bespoke training methodology is jointly introduced to facilitate the model performances. By initializing parameters with minuscule values, these attention modules are attenuated when the training begins, and the effects of feature interactions and embedding updates progressively grow up to optimum levels under the guidance of the proposed specific regularization approaches Swap-Mix and Hidden-Mix as the training proceeds. Experiments on 25 public tabular datasets show that our ExcelFormer is superior to extremely-tuned GBDTs, which is an unprecedented achievement of neural networks in supervised tabular learning.
    Learning Symbolic Representations for Reinforcement Learning of Non-Markovian Behavior. (arXiv:2301.02952v1 [cs.LG])
    Many real-world reinforcement learning (RL) problems necessitate learning complex, temporally extended behavior that may only receive reward signal when the behavior is completed. If the reward-worthy behavior is known, it can be specified in terms of a non-Markovian reward function - a function that depends on aspects of the state-action history, rather than just the current state and action. Such reward functions yield sparse rewards, necessitating an inordinate number of experiences to find a policy that captures the reward-worthy pattern of behavior. Recent work has leveraged Knowledge Representation (KR) to provide a symbolic abstraction of aspects of the state that summarize reward-relevant properties of the state-action history and support learning a Markovian decomposition of the problem in terms of an automaton over the KR. Providing such a decomposition has been shown to vastly improve learning rates, especially when coupled with algorithms that exploit automaton structure. Nevertheless, such techniques rely on a priori knowledge of the KR. In this work, we explore how to automatically discover useful state abstractions that support learning automata over the state-action history. The result is an end-to-end algorithm that can learn optimal policies with significantly fewer environment samples than state-of-the-art RL on simple non-Markovian domains.
    An open unified deep graph learning framework for discovering drug leads. (arXiv:2301.03424v1 [q-bio.BM])
    Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible models increases research overheads, and may even reduce success rates in drug discovery. Facilitating compatibilities requires establishing inherent model consistencies across lead discovery stages. Towards that effect, we propose an open deep graph learning (DGL) based pipeline: generative adversarial feature subspace enhancement (GAFSE), which first unifies the modeling of these stages into one learning framework. GAFSE also offers standardized modular design and streamlined interfaces for future expansions and community support. GAFSE combines adversarial/generative learning, graph attention network, graph reconstruction network, and optimizes the classification/regression loss, adversarial/generative loss, and reconstruction loss simultaneously. Convergence analysis theoretically guarantees model generalization performance. Exhaustive benchmarking demonstrates that the GAFSE pipeline achieves excellent performance across almost all lead discovery stages, while also providing valuable model interpretability. Hence, we believe this tool will enhance the efficiency and productivity of drug discovery researchers.
    Generalized Kernel Regularized Least Squares. (arXiv:2209.14355v2 [stat.ML] UPDATED)
    Kernel Regularized Least Squares (KRLS) is a popular method for flexibly estimating models that may have complex relationships between variables. However, its usefulness to many researchers is limited for two reasons. First, existing approaches are inflexible and do not allow KRLS to be combined with theoretically-motivated extensions such as random effects, unregularized fixed effects, or non-Gaussian outcomes. Second, estimation is extremely computationally intensive for even modestly sized datasets. Our paper addresses both concerns by introducing generalized KRLS (gKRLS). We note that KRLS can be re-formulated as a hierarchical model thereby allowing easy inference and modular model construction where KRLS can be used alongside random effects, splines, and unregularized fixed effects. Computationally, we also implement random sketching to dramatically accelerate estimation while incurring a limited penalty in estimation quality. We demonstrate that gKRLS can be fit on datasets with tens of thousands of observations in under one minute. Further, state-of-the-art techniques that require fitting the model over a dozen times (e.g. meta-learners) can be estimated quickly.
    Wasserstein Iterative Networks for Barycenter Estimation. (arXiv:2201.12245v2 [cs.LG] UPDATED)
    Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba! dataset which can be used for quantitative evaluation of barycenter algorithms by using standard metrics of generative models such as FID.
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v5 [cs.LG] UPDATED)
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with $G$ a group, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of ``$G$-invariant neural architecture design'': What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or ``shallow'' neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$, and we prove a second theorem that characterizes the inclusion maps or ``network morphisms'' between the architectures that can be leveraged during neural architecture search (NAS). The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons; the classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. The $G$-SNN architectures corresponding to nontrivial cohomology classes have, to our knowledge, never been explicitly identified in the literature previously. Using a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. Finally, we prove that architectures corresponding to inequivalent cohomology classes coincide in function space only when their weight matrices are zero, and we discuss the implications of this for NAS.
    Deep Injective Prior for Inverse Scattering. (arXiv:2301.03092v1 [cs.LG])
    In electromagnetic inverse scattering, we aim to reconstruct object permittivity from scattered waves. Deep learning is a promising alternative to traditional iterative solvers, but it has been used mostly in a supervised framework to regress the permittivity patterns from scattered fields or back-projections. While such methods are fast at test-time and achieve good results for specific data distributions, they are sensitive to the distribution drift of the scattered fields, common in practice. If the distribution of the scattered fields changes due to changes in frequency, the number of transmitters and receivers, or any other real-world factor, an end-to-end neural network must be re-trained or fine-tuned on a new dataset. In this paper, we propose a new data-driven framework for inverse scattering based on deep generative models. We model the target permittivities by a low-dimensional manifold which acts as a regularizer and learned from data. Unlike supervised methods which require both scattered fields and target signals, we only need the target permittivities for training; it can then be used with any experimental setup. We show that the proposed framework significantly outperforms the traditional iterative methods especially for strong scatterers while having comparable reconstruction quality to state-of-the-art deep learning methods like U-Net.
    Joint Liver and Hepatic Lesion Segmentation in MRI using a Hybrid CNN with Transformer Layers. (arXiv:2201.10981v2 [eess.IV] UPDATED)
    Deep learning-based segmentation of the liver and hepatic lesions therein steadily gains relevance in clinical practice due to the increasing incidence of liver cancer each year. Whereas various network variants with overall promising results in the field of medical image segmentation have been successfully developed over the last years, almost all of them struggle with the challenge of accurately segmenting hepatic lesions in magnetic resonance imaging (MRI). This led to the idea of combining elements of convolutional and transformer-based architectures to overcome the existing limitations. This work presents a hybrid network called SWTR-Unet, consisting of a pretrained ResNet, transformer blocks as well as a common Unet-style decoder path. This network was primarily applied to single-modality non-contrast-enhanced liver MRI and additionally to the publicly available computed tomography (CT) data of the liver tumor segmentation (LiTS) challenge to verify the applicability on other modalities. For a broader evaluation, multiple state-of-the-art networks were implemented and applied, ensuring a direct comparability. Furthermore, correlation analysis and an ablation study were carried out, to investigate various influencing factors on the segmentation accuracy of the presented method. With Dice scores of averaged 98+-2% for liver and 81+-28% lesion segmentation on the MRI dataset and 97+-2% and 79+-25%, respectively on the CT dataset, the proposed SWTR-Unet proved to be a precise approach for liver and hepatic lesion segmentation with state-of-the-art results for MRI and competing accuracy in CT imaging. The achieved segmentation accuracy was found to be on par with manually performed expert segmentations as indicated by inter-observer variabilities for liver lesion segmentation. In conclusion, the presented method could save valuable time and resources in clinical practice.
    Deepfake CAPTCHA: A Method for Preventing Fake Calls. (arXiv:2301.03064v1 [cs.CR])
    Deep learning technology has made it possible to generate realistic content of specific individuals. These `deepfakes' can now be generated in real-time which enables attackers to impersonate people over audio and video calls. Moreover, some methods only need a few images or seconds of audio to steal an identity. Existing defenses perform passive analysis to detect fake content. However, with the rapid progress of deepfake quality, this may be a losing game. In this paper, we propose D-CAPTCHA: an active defense against real-time deepfakes. The approach is to force the adversary into the spotlight by challenging the deepfake model to generate content which exceeds its capabilities. By doing so, passive detection becomes easier since the content will be distorted. In contrast to existing CAPTCHAs, we challenge the AI's ability to create content as opposed to its ability to classify content. In this work we focus on real-time audio deepfakes and present preliminary results on video. In our evaluation we found that D-CAPTCHA outperforms state-of-the-art audio deepfake detectors with an accuracy of 91-100% depending on the challenge (compared to 71% without challenges). We also performed a study on 41 volunteers to understand how threatening current real-time deepfake attacks are. We found that the majority of the volunteers could not tell the difference between real and fake audio.
    Upward lightning at wind turbines: Risk assessment from larger-scale meteorology. (arXiv:2301.03360v1 [stat.ML])
    Upward lightning (UL) has become an increasingly important threat to wind turbines as ever more of them are being installed for renewably producing electricity. The taller the wind turbine the higher the risk that the type of lightning striking the man-made structure is UL. UL can be much more destructive than downward lightning due to its long lasting initial continuous current leading to a large charge transfer within the lightning discharge process. Current standards for the risk assessment of lightning at wind turbines mainly take the summer lightning activity into account, which is inferred from LLS. Ground truth lightning current measurements reveal that less than 50% of UL might be detected by lightning location systems (LLS). This leads to a large underestimation of the proportion of LLS-non-detectable UL at wind turbines, which is the dominant lightning type in the cold season. This study aims to assess the risk of LLS-detectable and LLS-non-detectable UL at wind turbines using direct UL measurements at the Gaisberg Tower (Austria) and S\"antis Tower (Switzerland). Direct UL observations are linked to meteorological reanalysis data and joined by random forests, a powerful machine learning technique. The meteorological drivers for the non-/occurrence of LLS-detectable and LLS-non-detectable UL, respectively, are found from the random forest models trained at the towers and have large predictive skill on independent data. In a second step the results from the tower-trained models are extended to a larger study domain (Central and Northern Germany). The tower-trained models for LLS-detectable lightning is independently verified at wind turbine locations in that domain and found to reliably diagnose that type of UL. Risk maps based on case study events show that high diagnosed probabilities in the study domain coincide with actual UL events.
    $\mathcal{Y}$-Tuning: An Efficient Tuning Paradigm for Large-Scale Pre-Trained Models via Label Representation Learning. (arXiv:2202.09817v2 [cs.CL] UPDATED)
    With the success of large-scale pre-trained models (PTMs), how efficiently adapting PTMs to downstream tasks has attracted tremendous attention, especially for PTMs with billions of parameters. Although some parameter-efficient tuning paradigms have been proposed to address this problem, they still require large resources to compute the gradients in the training phase. In this paper, we propose $\mathcal{Y}$-Tuning, an efficient yet effective paradigm to adapt frozen large-scale PTMs to specific downstream tasks. $\mathcal{Y}$-tuning learns dense representations for labels $\mathcal{Y}$ defined in a given task and aligns them to fixed feature representation. Without tuning the features of input text and model parameters, $\mathcal{Y}$-tuning is both parameter-efficient and training-efficient. For $\text{DeBERTa}_\text{XXL}$ with 1.6 billion parameters, $\mathcal{Y}$-tuning achieves performance more than $96\%$ of full fine-tuning on GLUE Benchmark with only $2\%$ tunable parameters and much fewer training costs.
    Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans. (arXiv:2209.13020v12 [cs.CY] UPDATED)
    We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda embedding legal knowledge and reasoning in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
    So3krates: Equivariant attention for interactions on arbitrary length-scales in molecular systems. (arXiv:2205.14276v3 [cs.LG] UPDATED)
    The application of machine learning methods in quantum chemistry has enabled the study of numerous chemical phenomena, which are computationally intractable with traditional ab-initio methods. However, some quantum mechanical properties of molecules and materials depend on non-local electronic effects, which are often neglected due to the difficulty of modeling them efficiently. This work proposes a modified attention mechanism adapted to the underlying physics, which allows to recover the relevant non-local effects. Namely, we introduce spherical harmonic coordinates (SPHCs) to reflect higher-order geometric information for each atom in a molecule, enabling a non-local formulation of attention in the SPHC space. Our proposed model So3krates - a self-attention based message passing neural network - uncouples geometric information from atomic features, making them independently amenable to attention mechanisms. Thereby we construct spherical filters, which extend the concept of continuous filters in Euclidean space to SPHC space and serve as foundation for a spherical self-attention mechanism. We show that in contrast to other published methods, So3krates is able to describe non-local quantum mechanical effects over arbitrary length scales. Further, we find evidence that the inclusion of higher-order geometric correlations increases data efficiency and improves generalization. So3krates matches or exceeds state-of-the-art performance on popular benchmarks, notably, requiring a significantly lower number of parameters (0.25 - 0.4x) while at the same time giving a substantial speedup (6 - 14x for training and 2 - 11x for inference) compared to other models.
    Contrastive Trajectory Similarity Learning with Dual-Feature Attention. (arXiv:2210.05155v2 [cs.DB] UPDATED)
    Trajectory similarity measures act as query predicates in trajectory databases, making them the key player in determining the query results. They also have a heavy impact on the query efficiency. An ideal measure should have the capability to accurately evaluate the similarity between any two trajectories in a very short amount of time. Towards this aim, we propose a contrastive learning-based trajectory modeling method named TrajCL. We present four trajectory augmentation methods and a novel dual-feature self-attention-based trajectory backbone encoder. The resultant model can jointly learn both the spatial and the structural patterns of trajectories. Our model does not involve any recurrent structures and thus has a high efficiency. Besides, our pre-trained backbone encoder can be fine-tuned towards other computationally expensive measures with minimal supervision data. Experimental results show that TrajCL is consistently and significantly more accurate than the state-of-the-art trajectory similarity measures. After fine-tuning, i.e., to serve as an estimator for heuristic measures, TrajCL can even outperform the state-of-the-art supervised method by up to 56% in the accuracy for processing trajectory similarity queries.
    Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation. (arXiv:2301.03125v1 [stat.ML])
    The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term. In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we provide a two-phase extension of M-SPP to achieve comparable convergence rates. Moreover, we derive a near-tight high probability (over the randomness of data) bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.
    AnycostFL: Efficient On-Demand Federated Learning over Heterogeneous Edge Devices. (arXiv:2301.03062v1 [cs.LG])
    In this work, we investigate the challenging problem of on-demand federated learning (FL) over heterogeneous edge devices with diverse resource constraints. We propose a cost-adjustable FL framework, named AnycostFL, that enables diverse edge devices to efficiently perform local updates under a wide range of efficiency constraints. To this end, we design the model shrinking to support local model training with elastic computation cost, and the gradient compression to allow parameter transmission with dynamic communication overhead. An enhanced parameter aggregation is conducted in an element-wise manner to improve the model performance. Focusing on AnycostFL, we further propose an optimization design to minimize the global training loss with personalized latency and energy constraints. By revealing the theoretical insights of the convergence analysis, personalized training strategies are deduced for different devices to match their locally available resources. Experiment results indicate that, when compared to the state-of-the-art efficient FL algorithms, our learning framework can reduce up to 1.9 times of the training latency and energy consumption for realizing a reasonable global testing accuracy. Moreover, the results also demonstrate that, our approach significantly improves the converged global accuracy.
    Facial Misrecognition Systems: Simple Weight Manipulations Force DNNs to Err Only on Specific Persons. (arXiv:2301.03118v1 [cs.CR])
    In this paper we describe how to plant novel types of backdoors in any facial recognition model based on the popular architecture of deep Siamese neural networks, by mathematically changing a small fraction of its weights (i.e., without using any additional training or optimization). These backdoors force the system to err only on specific persons which are preselected by the attacker. For example, we show how such a backdoored system can take any two images of a particular person and decide that they represent different persons (an anonymity attack), or take any two images of a particular pair of persons and decide that they represent the same person (a confusion attack), with almost no effect on the correctness of its decisions for other persons. Uniquely, we show that multiple backdoors can be independently installed by multiple attackers who may not be aware of each other's existence with almost no interference. We have experimentally verified the attacks on a FaceNet-based facial recognition system, which achieves SOTA accuracy on the standard LFW dataset of $99.35\%$. When we tried to individually anonymize ten celebrities, the network failed to recognize two of their images as being the same person in $96.97\%$ to $98.29\%$ of the time. When we tried to confuse between the extremely different looking Morgan Freeman and Scarlett Johansson, for example, their images were declared to be the same person in $91.51 \%$ of the time. For each type of backdoor, we sequentially installed multiple backdoors with minimal effect on the performance of each one (for example, anonymizing all ten celebrities on the same model reduced the success rate for each celebrity by no more than $0.91\%$). In all of our experiments, the benign accuracy of the network on other persons was degraded by no more than $0.48\%$ (and in most cases, it remained above $99.30\%$).
    Generalized adaptive smoothing based neural network architecture for traffic state estimation. (arXiv:2301.03439v1 [eess.SY])
    The adaptive smoothing method (ASM) is a standard data-driven technique used in traffic state estimation. The ASM has free parameters which, in practice, are chosen to be some generally acceptable values based on intuition. However, we note that the heuristically chosen values often result in un-physical predictions by the ASM. In this work, we propose a neural network based on the ASM which tunes those parameters automatically by learning from sparse data from road sensors. We refer to it as the adaptive smoothing neural network (ASNN). We also propose a modified ASNN (MASNN), which makes it a strong learner by using ensemble averaging. The ASNN and MASNN are trained and tested two real-world datasets. Our experiments reveal that the ASNN and the MASNN outperform the conventional ASM.
    L-SeqSleepNet: Whole-cycle Long Sequence Modelling for Automatic Sleep Staging. (arXiv:2301.03441v1 [eess.SP])
    Human sleep is cyclical with a period of approximately 90 minutes, implying long temporal dependency in the sleep data. Yet, exploring this long-term dependency when developing sleep staging models has remained untouched. In this work, we show that while encoding the logic of a whole sleep cycle is crucial to improve sleep staging performance, the sequential modelling approach in existing state-of-the-art deep learning models are inefficient for that purpose. We then introduce a method for efficient long sequence modelling and propose a new deep learning model, L-SeqSleepNet, incorporating this method to take into account whole-cycle sleep information for sleep staging. Evaluating L-SeqSleepNet on a set of four distinct databases of various sizes, we demonstrate state-of-the-art performance obtained by the model over three different EEG setups, including scalp EEG in conventional Polysomnography (PSG), in-ear EEG, and around-the-ear EEG (cEEGrid), even with a single-EEG channel input. Our analyses also show that L-SeqSleepNet is able to remedy the effect of N2 sleep (the major class in terms of classification) to bring down errors in other sleep stages and that the network largely reduces exceptionally high errors seen in many subjects. Finally, the computation time only grows at a sub-linear rate when the sequence length increases.
    Fair Clustering Under a Bounded Cost. (arXiv:2106.07239v2 [cs.LG] UPDATED)
    Clustering is a fundamental unsupervised learning problem where a dataset is partitioned into clusters that consist of nearby points in a metric space. A recent variant, fair clustering, associates a color with each point representing its group membership and requires that each color has (approximately) equal representation in each cluster to satisfy group fairness. In this model, the cost of the clustering objective increases due to enforcing fairness in the algorithm. The relative increase in the cost, the ''price of fairness,'' can indeed be unbounded. Therefore, in this paper we propose to treat an upper bound on the clustering objective as a constraint on the clustering problem, and to maximize equality of representation subject to it. We consider two fairness objectives: the group utilitarian objective and the group egalitarian objective, as well as the group leximin objective which generalizes the group egalitarian objective. We derive fundamental lower bounds on the approximation of the utilitarian and egalitarian objectives and introduce algorithms with provable guarantees for them. For the leximin objective we introduce an effective heuristic algorithm. We further derive impossibility results for other natural fairness objectives. We conclude with experimental results on real-world datasets that demonstrate the validity of our algorithms.
    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. (arXiv:2301.03566v1 [math.ST])
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.
    Improved Training of Physics-Informed Neural Networks with Model Ensembles. (arXiv:2204.05108v2 [cs.LG] UPDATED)
    Learning the solution of partial differential equations (PDEs) with a neural network (known in the literature as a physics-informed neural network, PINN) is an attractive alternative to traditional solvers due to its elegancy, greater flexibility and the ease of incorporating observed data. However, training PINNs is notoriously difficult in practice. One problem is the existence of multiple simple (but wrong) solutions which are attractive for PINNs when the solution interval is too large. In this paper, we propose to expand the solution interval gradually to make the PINN converge to the correct solution. To find a good schedule for the solution interval expansion, we train an ensemble of PINNs. The idea is that all ensemble members converge to the same solution in the vicinity of observed data (e.g., initial conditions) while they may be pulled towards different wrong solutions farther away from the observations. Therefore, we use the ensemble agreement as the criterion for including new points for computing the loss derived from PDEs. We show experimentally that the proposed method can improve the accuracy of the found solution.
    Discovering and Explaining the Representation Bottleneck of Graph Neural Networks from Multi-order Interactions. (arXiv:2205.07266v4 [cs.LG] UPDATED)
    Graph neural networks (GNNs) mainly rely on the message-passing paradigm to propagate node features and build interactions, and different graph learning tasks require different ranges of node interactions. In this work, we explore the capacity of GNNs to capture interactions between nodes under contexts with different complexities. We discover that GNNs are usually unable to capture the most informative kinds of interaction styles for diverse graph learning tasks, and thus name this phenomenon as GNNs' representation bottleneck. As a response, we demonstrate that the inductive bias introduced by existing graph construction mechanisms can prevent GNNs from learning interactions of the most appropriate complexity, i.e., resulting in the representation bottleneck. To address that limitation, we propose a novel graph rewiring approach based on interaction patterns learned by GNNs to adjust the receptive fields of each node dynamically. Extensive experiments on both real-world and synthetic datasets prove the effectiveness of our algorithm to alleviate the representation bottleneck and its superiority to enhance the performance of GNNs over state-of-the-art graph rewiring baselines.
    Can Foundation Models Help Us Achieve Perfect Secrecy?. (arXiv:2205.13722v2 [cs.LG] UPDATED)
    A key promise of machine learning is the ability to assist users with personal tasks. Because the personal context required to make accurate predictions is often sensitive, we require systems that protect privacy. A gold standard privacy-preserving system will satisfy perfect secrecy, meaning that interactions with the system provably reveal no private information. However, privacy and quality appear to be in tension in existing systems for personal tasks. Neural models typically require copious amounts of training to perform well, while individual users typically hold a limited scale of data, so federated learning (FL) systems propose to learn from the aggregate data of multiple users. FL does not provide perfect secrecy, but rather practitioners apply statistical notions of privacy -- i.e., the probability of learning private information about a user should be reasonably low. The strength of the privacy guarantee is governed by privacy parameters. Numerous privacy attacks have been demonstrated on FL systems and it can be challenging to reason about the appropriate privacy parameters for a privacy-sensitive use case. Therefore our work proposes a simple baseline for FL, which both provides the stronger perfect secrecy guarantee and does not require setting any privacy parameters. We initiate the study of when and where an emerging tool in ML -- the in-context learning abilities of recent pretrained models -- can be an effective baseline alongside FL. We find in-context learning is competitive with strong FL baselines on 6 of 7 popular benchmarks from the privacy literature and a real-world case study, which is disjoint from the pretraining data. We release our code here: https://github.com/simran-arora/focus
    Kantorovich Strikes Back! Wasserstein GANs are not Optimal Transport?. (arXiv:2206.07767v2 [cs.LG] UPDATED)
    Wasserstein Generative Adversarial Networks (WGANs) are the popular generative models built on the theory of Optimal Transport (OT) and the Kantorovich duality. Despite the success of WGANs, it is still unclear how well the underlying OT dual solvers approximate the OT cost (Wasserstein-1 distance, $\mathbb{W}_{1}$) and the OT gradient needed to update the generator. In this paper, we address these questions. We construct 1-Lipschitz functions and use them to build ray monotone transport plans. This strategy yields pairs of continuous benchmark distributions with the analytically known OT plan, OT cost and OT gradient in high-dimensional spaces such as spaces of images. We thoroughly evaluate popular WGAN dual form solvers (gradient penalty, spectral normalization, entropic regularization, etc.) using these benchmark pairs. Even though these solvers perform well in WGANs, none of them faithfully compute $\mathbb{W}_{1}$ in high dimensions. Nevertheless, many provide a meaningful approximation of the OT gradient. These observations suggest that these solvers should not be treated as good estimators of $\mathbb{W}_{1}$, but to some extent they indeed can be used in variational problems requiring the minimization of $\mathbb{W}_{1}$.
    Making Decisions under Outcome Performativity. (arXiv:2210.01745v2 [cs.LG] UPDATED)
    Decision-makers often act in response to data-driven predictions, with the goal of achieving favorable outcomes. In such settings, predictions don't passively forecast the future; instead, predictions actively shape the distribution of outcomes they are meant to predict. This performative prediction setting raises new challenges for learning "optimal" decision rules. In particular, existing solution concepts do not address the apparent tension between the goals of forecasting outcomes accurately and steering individuals to achieve desirable outcomes. To contend with this concern, we introduce a new optimality concept -- performative omniprediction -- adapted from the supervised (non-performative) learning setting. A performative omnipredictor is a single predictor that simultaneously encodes the optimal decision rule with respect to many possibly-competing objectives. Our main result demonstrates that efficient performative omnipredictors exist, under a natural restriction of performative prediction, which we call outcome performativity. On a technical level, our results follow by carefully generalizing the notion of outcome indistinguishability to the outcome performative setting. From an appropriate notion of Performative OI, we recover many consequences known to hold in the supervised setting, such as omniprediction and universal adaptability.
    Annealed Score-Based Diffusion Model for MR Motion Artifact Reduction. (arXiv:2301.03027v1 [eess.IV])
    Motion artifact reduction is one of the important research topics in MR imaging, as the motion artifact degrades image quality and makes diagnosis difficult. Recently, many deep learning approaches have been studied for motion artifact reduction. Unfortunately, most existing models are trained in a supervised manner, requiring paired motion-corrupted and motion-free images, or are based on a strict motion-corruption model, which limits their use for real-world situations. To address this issue, here we present an annealed score-based diffusion model for MRI motion artifact reduction. Specifically, we train a score-based model using only motion-free images, and then motion artifacts are removed by applying forward and reverse diffusion processes repeatedly to gradually impose a low-frequency data consistency. Experimental results verify that the proposed method successfully reduces both simulated and in vivo motion artifacts, outperforming the state-of-the-art deep learning methods.
    Machine-Learning Prediction of the Computed Band Gaps of Double Perovskite Materials. (arXiv:2301.03372v1 [cond-mat.mtrl-sci])
    Prediction of the electronic structure of functional materials is essential for the engineering of new devices. Conventional electronic structure prediction methods based on density functional theory (DFT) suffer from not only high computational cost, but also limited accuracy arising from the approximations of the exchange-correlation functional. Surrogate methods based on machine learning have garnered much attention as a viable alternative to bypass these limitations, especially in the prediction of solid-state band gaps, which motivated this research study. Herein, we construct a random forest regression model for band gaps of double perovskite materials, using a dataset of 1306 band gaps computed with the GLLBSC (Gritsenko, van Leeuwen, van Lenthe, and Baerends solid correlation) functional. Among the 20 physical features employed, we find that the bulk modulus, superconductivity temperature, and cation electronegativity exhibit the highest importance scores, consistent with the physics of the underlying electronic structure. Using the top 10 features, a model accuracy of 85.6% with a root mean square error of 0.64 eV is obtained, comparable to previous studies. Our results are significant in the sense that they attest to the potential of machine learning regressions for the rapid screening of promising candidate functional materials.
    Optimization-based Causal Estimation from Heterogenous Environments. (arXiv:2109.11990v2 [stat.ME] UPDATED)
    This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.
    Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions. (arXiv:2203.09436v4 [math.OC] UPDATED)
    We study stochastic monotone inclusion problems, which widely appear in machine learning applications, including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction. In the cocoercive -- and more generally Lipschitz-monotone -- setup, our algorithm attains $\epsilon$ norm of the operator with $\mathcal{O}(\frac{1}{\epsilon^3})$ stochastic operator evaluations, which significantly improves over state of the art $\mathcal{O}(\frac{1}{\epsilon^4})$ stochastic operator evaluations required for existing monotone inclusion solvers applied to the same problem classes. We further show how to couple one of the proposed variants of stochastic Halpern iteration with a scheduled restart scheme to solve stochastic monotone inclusion problems with ${\mathcal{O}}(\frac{\log(1/\epsilon)}{\epsilon^2})$ stochastic operator evaluations under additional sharpness or strong monotonicity assumptions.
    Differentiable Safe Controller Design through Control Barrier Functions. (arXiv:2209.10034v2 [eess.SY] UPDATED)
    Learning-based controllers, such as neural network (NN) controllers, can show high empirical performance but lack formal safety guarantees. To address this issue, control barrier functions (CBFs) have been applied as a safety filter to monitor and modify the outputs of learning-based controllers in order to guarantee the safety of the closed-loop system. However, such modification can be myopic with unpredictable long-term effects. In this work, we propose a safe-by-construction NN controller which employs differentiable CBF-based safety layers, and investigate the performance of safe-by-construction NN controllers in learning-based control. Specifically, two formulations of controllers are compared: one is projection-based and the other relies on our proposed set-theoretic parameterization. Both methods demonstrate improved closed-loop performance over using CBF as a separate safety filter in numerical experiments.
    Check Your Other Door! Creating Backdoor Attacks in the Frequency Domain. (arXiv:2109.05507v3 [cs.CR] UPDATED)
    Deep Neural Networks (DNNs) are ubiquitous and span a variety of applications ranging from image classification to real-time object detection. As DNN models become more sophisticated, the computational cost of training these models becomes a burden. For this reason, outsourcing the training process has been the go-to option for many DNN users. Unfortunately, this comes at the cost of vulnerability to backdoor attacks. These attacks aim to establish hidden backdoors in the DNN so that it performs well on clean samples, but outputs a particular target label when a trigger is applied to the input. Existing backdoor attacks either generate triggers in the spatial domain or naively poison frequencies in the Fourier domain. In this work, we propose a pipeline based on Fourier heatmaps to generate a spatially dynamic and invisible backdoor attack in the frequency domain. The proposed attack is extensively evaluated on various datasets and network architectures. Unlike most existing backdoor attacks, the proposed attack can achieve high attack success rates with low poisoning rates and little to no drop in performance while remaining imperceptible to the human eye. Moreover, we show that the models poisoned by our attack are resistant to various state-of-the-art (SOTA) defenses, so we contribute two possible defenses that can evade the attack.
    Community detection in multiplex networks based on orthogonal nonnegative matrix tri-factorization. (arXiv:2205.00626v2 [cs.SI] UPDATED)
    Networks are commonly used to model complex systems. The different entities in the system are represented by nodes of the network and their interactions by edges. In most real life systems, the different entities may interact in different ways necessitating the use of multiplex networks where multiple links are used to model the interactions. One of the major tools for inferring network topology is community detection. Although there are numerous works on community detection in single-layer networks, existing community detection methods for multiplex networks mostly learn a common community structure across layers and do not take the heterogeneity across layers into account. In this paper, we introduce a new multiplex community detection method that identifies communities that are common across layers as well as those that are unique to each layer. The proposed method, Multiplex Orthogonal Nonnegative Matrix Tri-Factorization, represents the adjacency matrix of each layer as the sum of two low-rank matrix factorizations corresponding to the common and private communities, respectively. Unlike most of the existing methods, which require the number of communities to be pre-determined, the proposed method also introduces a two stage method to determine the number of common and private communities. The proposed algorithm is evaluated on synthetic and real multiplex networks, as well as for multiview clustering applications, and compared to state-of-the-art techniques.
    Learning Program Representations with a Tree-Structured Transformer. (arXiv:2208.08643v2 [cs.SE] UPDATED)
    Learning vector representations for programs is a critical step in applying deep learning techniques for program understanding tasks. Various neural network models are proposed to learn from tree-structured program representations, e.g., abstract syntax tree (AST) and concrete syntax tree (CST). However, most neural architectures either fail to capture long-range dependencies which are ubiquitous in programs, or cannot learn effective representations for syntax tree nodes, making them incapable of performing the node-level prediction tasks, e.g., bug localization. In this paper, we propose Tree-Transformer, a novel recursive tree-structured neural network to learn the vector representations for source codes. We propose a multi-head attention mechanism to model the dependency between siblings and parent-children node pairs. Moreover, we propose a bi-directional propagation strategy to allow node information passing in two directions, bottom-up and top-down along trees. In this way, Tree-Transformer can learn the information of the node features as well as the global contextual information. The extensive experimental results show that our Tree-Transformer significantly outperforms the existing tree-based and graph-based program representation learning approaches in both the tree-level and node-level prediction tasks.
    VQNet 2.0: A New Generation Machine Learning Framework that Unifies Classical and Quantum. (arXiv:2301.03251v1 [quant-ph])
    With the rapid development of classical and quantum machine learning, a large number of machine learning frameworks have been proposed. However, existing machine learning frameworks usually only focus on classical or quantum, rather than both. Therefore, based on VQNet 1.0, we further propose VQNet 2.0, a new generation of unified classical and quantum machine learning framework that supports hybrid optimization. The core library of the framework is implemented in C++, and the user level is implemented in Python, and it supports deployment on quantum and classical hardware. In this article, we analyze the development trend of the new generation machine learning framework and introduce the design principles of VQNet 2.0 in detail: unity, practicality, efficiency, and compatibility, as well as full particulars of implementation. We illustrate the functions of VQNet 2.0 through several basic applications, including classical convolutional neural networks, quantum autoencoders, hybrid classical-quantum networks, etc. After that, through extensive experiments, we demonstrate that the operation speed of VQNet 2.0 is higher than the comparison method. Finally, through extensive experiments, we demonstrate that VQNet 2.0 can deploy on different hardware platforms, the overall calculation speed is faster than the comparison method. It also can be mixed and optimized with quantum circuits composed of multiple quantum computing libraries.
    Deep Insights of Deepfake Technology : A Review. (arXiv:2105.00192v2 [cs.LG] UPDATED)
    Under the aegis of computer vision and deep learning technology, a new emerging techniques has introduced that anyone can make highly realistic but fake videos, images even can manipulates the voices. This technology is widely known as Deepfake Technology. Although it seems interesting techniques to make fake videos or image of something or some individuals but it could spread as misinformation via internet. Deepfake contents could be dangerous for individuals as well as for our communities, organizations, countries religions etc. As Deepfake content creation involve a high level expertise with combination of several algorithms of deep learning, it seems almost real and genuine and difficult to differentiate. In this paper, a wide range of articles have been examined to understand Deepfake technology more extensively. We have examined several articles to find some insights such as what is Deepfake, who are responsible for this, is there any benefits of Deepfake and what are the challenges of this technology. We have also examined several creation and detection techniques. Our study revealed that although Deepfake is a threat to our societies, proper measures and strict regulations could prevent this.
    Locally-symplectic neural networks for learning volume-preserving dynamics. (arXiv:2109.09151v2 [math-ph] UPDATED)
    We propose locally-symplectic neural networks LocSympNets for learning the flow of phase volume-preserving dynamics. The construction of LocSympNets stems from the theorem of the local Hamiltonian description of the divergence-free vector field and the splitting methods based on symplectic integrators. Symplectic gradient modules of the recently proposed symplecticity-preserving neural networks SympNets are used to construct invertible locally-symplectic modules. To further preserve properties of the flow of a dynamical system LocSympNets are extended to symmetric locally-symplectic neural networks SymLocSympNets, such that the inverse of SymLocSympNets is equal to the feed-forward propagation of SymLocSympNets with the negative time step, which is a general property of the flow of a dynamical system. LocSympNets and SymLocSympNets are studied numerically considering learning linear and nonlinear volume-preserving dynamics. We demonstrate learning of linear traveling wave solutions to the semi-discretized advection equation, periodic trajectories of the Euler equations of the motion of a free rigid body, and quasi-periodic solutions of the charged particle motion in an electromagnetic field. LocSympNets and SymLocSympNets can learn linear and nonlinear dynamics to a high degree of accuracy even when random noise is added to the training data. When learning a single trajectory of the rigid body dynamics locally-symplectic neural networks can learn both quadratic invariants of the system with absolute relative errors below 1%. In addition, SymLocSympNets produce qualitatively good long-time predictions, when the learning of the whole system from randomly sampled data is considered. LocSympNets and SymLocSympNets can produce accurate short-time predictions of quasi-periodic solutions, which is illustrated in the example of the charged particle motion in an electromagnetic field.
    Neural parameter calibration for large-scale multi-agent models. (arXiv:2209.13565v2 [math.OC] UPDATED)
    Computational models have become a powerful tool in the quantitative sciences to understand the behaviour of complex systems that evolve in time. However, they often contain a potentially large number of free parameters whose values cannot be obtained from theory but need to be inferred from data. This is especially the case for models in the social sciences, economics, or computational epidemiology. Yet many current parameter estimation methods are mathematically involved and computationally slow to run. In this paper we present a computationally simple and fast method to retrieve accurate probability densities for model parameters using neural differential equations. We present a pipeline comprising multi-agent models acting as forward solvers for systems of ordinary or stochastic differential equations, and a neural network to then extract parameters from the data generated by the model. The two combined create a powerful tool that can quickly estimate densities on model parameters, even for very large systems. We demonstrate the method on synthetic time series data of the SIR model of the spread of infection, and perform an in-depth analysis of the Harris-Wilson model of economic activity on a network, representing a non-convex problem. For the latter, we apply our method both to synthetic data and to data of economic activity across Greater London. We find that our method calibrates the model orders of magnitude more accurately than a previous study of the same dataset using classical techniques, while running between 195 and 390 times faster.
    Retrieving Users' Opinions on Social Media with Multimodal Aspect-Based Sentiment Analysis. (arXiv:2210.15377v2 [cs.IR] UPDATED)
    People post their opinions and experiences on social media, yielding rich databases of end-users' sentiments. This paper shows to what extent machine learning can analyze and structure these databases. An automated data analysis pipeline is deployed to provide insights into user-generated content for researchers in other domains. First, the domain expert can select an image and a term of interest. Then, the pipeline uses image retrieval to find all images showing similar content and applies aspect-based sentiment analysis to outline users' opinions about the selected term. As part of an interdisciplinary project between architecture and computer science researchers, an empirical study of Hamburg's Elbphilharmonie was conveyed. Therefore, we selected 300 thousand posts with the hashtag \enquote{\texttt{hamburg}} from the platform Flickr. Image retrieval methods generated a subset of slightly more than 1.5 thousand images displaying the Elbphilharmonie. We found that these posts mainly convey a neutral or positive sentiment towards it. With this pipeline, we suggest a new semantic computing method that offers novel insights into end-users opinions, e.g., for architecture domain experts.
    FedDebug: Systematic Debugging for Federated Learning Applications. (arXiv:2301.03553v1 [cs.SE])
    In Federated Learning (FL), clients train a model locally and share it with a central aggregator to build a global model. Impermissibility to access client's data and collaborative training makes FL appealing for applications with data-privacy concerns such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, finding the round and the clients responsible is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug's {\em breakpoint} can help inspect an FL state (round, client, and global model) and seamlessly move between rounds and clients' models, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client responsible for lowering global model's performance without any testing data and labels--both are essential for existing debugging techniques. FedDebug's strengths come from adapting differential testing in conjunction with neurons activations to determine the precise client deviating from normal behavior. FedDebug achieves 100\% to find a single client and 90.3\% accuracy to find multiple faulty clients. FedDebug's interactive debugging incurs 1.2\% overhead during training, while it localizes a faulty client in only 2.1\% of a round's training time. With FedDebug, we bring effective debugging practices to federated learning, improving the quality and productivity of FL application developers.
    Differentially private inference via noisy optimization. (arXiv:2103.11003v3 [math.ST] UPDATED)
    We propose a general optimization-based framework for computing differentially private M-estimators and a new method for constructing differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent or noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish local and global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a nearly optimal neighborhood of the non-private M-estimators. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to approximate pivotal statistics for constructing confidence regions and conducting hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods in several numerical examples.
    MixCycle: Unsupervised Speech Separation via Cyclic Mixture Permutation Invariant Training. (arXiv:2202.03875v2 [eess.AS] UPDATED)
    We introduce two unsupervised source separation methods, which involve self-supervised training from single-channel two-source speech mixtures. Our first method, mixture permutation invariant training (MixPIT), enables learning a neural network model which separates the underlying sources via a challenging proxy task without supervision from the reference sources. Our second method, cyclic mixture permutation invariant training (MixCycle), uses MixPIT as a building block in a cyclic fashion for continuous learning. MixCycle gradually converts the problem from separating mixtures of mixtures into separating single mixtures. We compare our methods to common supervised and unsupervised baselines: permutation invariant training with dynamic mixing (PIT-DM) and mixture invariant training (MixIT). We show that MixCycle outperforms MixIT and reaches a performance level very close to the supervised baseline (PIT-DM) while circumventing the over-separation issue of MixIT. Also, we propose a self-evaluation technique inspired by MixCycle that estimates model performance without utilizing any reference sources. We show that it yields results consistent with an evaluation on reference sources (LibriMix) and also with an informal listening test conducted on a real-life mixtures dataset (REAL-M).
    Removing Non-Stationary Knowledge From Pre-Trained Language Models for Entity-Level Sentiment Classification in Finance. (arXiv:2301.03136v1 [cs.CL])
    Extraction of sentiment signals from news text, stock message boards, and business reports, for stock movement prediction, has been a rising field of interest in finance. Building upon past literature, the most recent works attempt to better capture sentiment from sentences with complex syntactic structures by introducing aspect-level sentiment classification (ASC). Despite the growing interest, however, fine-grained sentiment analysis has not been fully explored in non-English literature due to the shortage of annotated finance-specific data. Accordingly, it is necessary for non-English languages to leverage datasets and pre-trained language models (PLM) of different domains, languages, and tasks to best their performance. To facilitate finance-specific ASC research in the Korean language, we build KorFinASC, a Korean aspect-level sentiment classification dataset for finance consisting of 12,613 human-annotated samples, and explore methods of intermediate transfer learning. Our experiments indicate that past research has been ignorant towards the potentially wrong knowledge of financial entities encoded during the training phase, which has overestimated the predictive power of PLMs. In our work, we use the term "non-stationary knowledge'' to refer to information that was previously correct but is likely to change, and present "TGT-Masking'', a novel masking pattern to restrict PLMs from speculating knowledge of the kind. Finally, through a series of transfer learning with TGT-Masking applied we improve 22.63% of classification accuracy compared to standalone models on KorFinASC.
    A review of clustering models in educational data science towards fairness-aware learning. (arXiv:2301.03421v1 [cs.LG])
    Ensuring fairness is essential for every education system. Machine learning is increasingly supporting the education system and educational data science (EDS) domain, from decision support to educational activities and learning analytics. However, the machine learning-based decisions can be biased because the algorithms may generate the results based on students' protected attributes such as race or gender. Clustering is an important machine learning technique to explore student data in order to support the decision-maker, as well as support educational activities, such as group assignments. Therefore, ensuring high-quality clustering models along with satisfying fairness constraints are important requirements. This chapter comprehensively surveys clustering models and their fairness in EDS. We especially focus on investigating the fair clustering models applied in educational activities. These models are believed to be practical tools for analyzing students' data and ensuring fairness in EDS.
    Federated Learning with Domain Generalization. (arXiv:2111.10487v2 [cs.LG] UPDATED)
    Federated Learning (FL) enables a group of clients to jointly train a machine learning model with the help of a centralized server. Clients do not need to submit their local data to the server during training, and hence the local training data of clients is protected. In FL, distributed clients collect their local data independently, so the dataset of each client may naturally form a distinct source domain. In practice, the model trained over multiple source domains may have poor generalization performance on unseen target domains. To address this issue, we propose FedADG to equip federated learning with domain generalization capability. FedADG employs the federated adversarial learning approach to measure and align the distributions among different source domains via matching each distribution to a reference distribution. The reference distribution is adaptively generated (by accommodating all source domains) to minimize the domain shift distance during alignment. In FedADG, the alignment is fine-grained since each class is aligned independently. In this way, the learned feature representation is supposed to be universal, so it can generalize well on the unseen domains. Intensive experiments on various datasets demonstrate that FedADG has comparable performance with the state-of-the-art.
    Why Batch Normalization Damage Federated Learning on Non-IID Data?. (arXiv:2301.02982v1 [cs.LG])
    As a promising distributed learning paradigm, federated learning (FL) involves training deep neural network (DNN) models at the network edge while protecting the privacy of the edge clients. To train a large-scale DNN model, batch normalization (BN) has been regarded as a simple and effective means to accelerate the training and improve the generalization capability. However, recent findings indicate that BN can significantly impair the performance of FL in the presence of non-i.i.d. data. While several FL algorithms have been proposed to address this issue, their performance still falls significantly when compared to the centralized scheme. Furthermore, none of them have provided a theoretical explanation of how the BN damages the FL convergence. In this paper, we present the first convergence analysis to show that under the non-i.i.d. data, the mismatch between the local and global statistical parameters in BN causes the gradient deviation between the local and global models, which, as a result, slows down and biases the FL convergence. In view of this, we develop a new FL algorithm that is tailored to BN, called FedTAN, which is capable of achieving robust FL performance under a variety of data distributions via iterative layer-wise parameter aggregation. Comprehensive experimental results demonstrate the superiority of the proposed FedTAN over existing baselines for training BN-based DNN models.  ( 2 min )
    Physics-Informed Kernel Embeddings: Integrating Prior System Knowledge with Data-Driven Control. (arXiv:2301.03565v1 [eess.SY])
    Data-driven control algorithms use observations of system dynamics to construct an implicit model for the purpose of control. However, in practice, data-driven techniques often require excessive sample sizes, which may be infeasible in real-world scenarios where only limited observations of the system are available. Furthermore, purely data-driven methods often neglect useful a priori knowledge, such as approximate models of the system dynamics. We present a method to incorporate such prior knowledge into data-driven control algorithms using kernel embeddings, a nonparametric machine learning technique based in the theory of reproducing kernel Hilbert spaces. Our proposed approach incorporates prior knowledge of the system dynamics as a bias term in the kernel learning problem. We formulate the biased learning problem as a least-squares problem with a regularization term that is informed by the dynamics, that has an efficiently computable, closed-form solution. Through numerical experiments, we empirically demonstrate the improved sample efficiency and out-of-sample generalization of our approach over a purely data-driven baseline. We demonstrate an application of our method to control through a target tracking problem with nonholonomic dynamics, and on spring-mass-damper and F-16 aircraft state prediction tasks.
    Equivariant and Steerable Neural Networks: A review with special emphasis on the symmetric group. (arXiv:2301.03019v1 [cs.LG])
    Convolutional neural networks revolutionized computer vision and natrual language processing. Their efficiency, as compared to fully connected neural networks, has its origin in the architecture, where convolutions reflect the translation invariance in space and time in pattern or speech recognition tasks. Recently, Cohen and Welling have put this in the broader perspective of invariance under symmetry groups, which leads to the concept of group equivaiant neural networks and more generally steerable neural networks. In this article, we review the architecture of such networks including equivariant layers and filter banks, activation with capsules and group pooling. We apply this formalism to the symmetric group, for which we work out a number of details on representations and capsules that are not found in the literature.  ( 2 min )
    Systems for Parallel and Distributed Large-Model Deep Learning Training. (arXiv:2301.02691v1 [cs.DC])
    Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore increasingly large neural architectures, with some recent Transformer models spanning hundreds of billions of learnable parameters. These designs have introduced new scale-driven systems challenges for the DL space, such as memory bottlenecks, poor runtime efficiency, and high costs of model development. Efforts to address these issues have explored techniques such as parallelization of neural architectures, spilling data across the memory hierarchy, and memory-efficient data representations. This survey will explore the large-model training systems landscape, highlighting key challenges and the various techniques that have been used to address them.  ( 2 min )
    Exploration in Model-based Reinforcement Learning with Randomized Reward. (arXiv:2301.03142v1 [stat.ML])
    Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.  ( 2 min )
    Machine Learning to Estimate Gross Loss of Jewelry for Wax Patterns. (arXiv:2301.02872v1 [cs.LG])
    In mass manufacturing of jewellery, the gross loss is estimated before manufacturing to calculate the wax weight of the pattern that would be investment casted to make multiple identical pieces of jewellery. Machine learning is a technology that is a part of AI which helps create a model with decision-making capabilities based on a large set of user-defined data. In this paper, the authors found a way to use Machine Learning in the jewellery industry to estimate this crucial Gross Loss. Choosing a small data set of manufactured rings and via regression analysis, it was found out that there is a potential of reducing the error in estimation from +-2-3 to +-0.5 using ML Algorithms from historic data and attributes collected from the CAD file during the design phase itself. To evaluate the approach's viability, additional study must be undertaken with a larger data set.  ( 2 min )
    DebiasedDTA: A Framework for Improving the Generalizability of Drug-Target Affinity Prediction Models. (arXiv:2107.05556v5 [q-bio.QM] UPDATED)
    Computational models that accurately predict the binding affinity of an input protein-chemical pair can accelerate drug discovery studies. These models are trained on available protein-chemical interaction datasets, which may contain dataset biases that may lead the model to learn dataset-specific patterns, instead of generalizable relationships. As a result, the prediction performance of models drops for previously unseen biomolecules, $\textit{i.e.}$ the prediction models cannot generalize to biomolecules outside of the dataset. The latest approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading prediction performance. Here, we present DebiasedDTA, a novel drug-target affinity (DTA) prediction model training framework that addresses dataset biases to improve the generalizability of affinity prediction models. DebiasedDTA reweights the training samples to mitigate the effect of dataset biases and is applicable to most DTA prediction models. The results suggest that models trained in the DebiasedDTA framework can achieve improved generalizability in predicting the interactions of the previously unseen biomolecules, as well as performance improvements on those previously seen. Extensive experiments with different biomolecule representations, model architectures, and datasets demonstrate that DebiasedDTA can upgrade DTA prediction models irrespective of the biomolecule representation, model architecture, and training dataset. Last but not least, we release DebiasedDTA as an open-source python library to enable other researchers to debias their own predictors and/or develop their own debiasing methods. We believe that this python library will corroborate and foster research to develop more generalizable DTA prediction models.
    FlexShard: Flexible Sharding for Industry-Scale Sequence Recommendation Models. (arXiv:2301.02959v1 [cs.LG])
    Sequence-based deep learning recommendation models (DLRMs) are an emerging class of DLRMs showing great improvements over their prior sum-pooling based counterparts at capturing users' long term interests. These improvements come at immense system cost however, with sequence-based DLRMs requiring substantial amounts of data to be dynamically materialized and communicated by each accelerator during a single iteration. To address this rapidly growing bottleneck, we present FlexShard, a new tiered sequence embedding table sharding algorithm which operates at a per-row granularity by exploiting the insight that not every row is equal. Through precise replication of embedding rows based on their underlying probability distribution, along with the introduction of a new sharding strategy adapted to the heterogeneous, skewed performance of real-world cluster network topologies, FlexShard is able to significantly reduce communication demand while using no additional memory compared to the prior state-of-the-art. When evaluated on production-scale sequence DLRMs, FlexShard was able to reduce overall global all-to-all communication traffic by over 85%, resulting in end-to-end training communication latency improvements of almost 6x over the prior state-of-the-art approach.
    Using External Off-Policy Speech-To-Text Mappings in Contextual End-To-End Automated Speech Recognition. (arXiv:2301.02736v1 [eess.AS])
    Despite improvements to the generalization performance of automated speech recognition (ASR) models, specializing ASR models for downstream tasks remains a challenging task, primarily due to reduced data availability (necessitating increased data collection), and rapidly shifting data distributions (requiring more frequent model fine-tuning). In this work, we investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods, to allow for flexible post-training adaptation to new data distributions. In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR via an approximate k-nearest-neighbor (KNN) based attentive fusion step. Our experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours while providing up to 3% WER improvement compared to a fine-tuning baseline, suggesting a promising approach for adapting production ASR systems in challenging zero and few-shot scenarios.  ( 2 min )
    The Optimal Input-Independent Baseline for Binary Classification: The Dutch Draw. (arXiv:2301.03318v1 [cs.LG])
    Before any binary classification model is taken into practice, it is important to validate its performance on a proper test set. Without a frame of reference given by a baseline method, it is impossible to determine if a score is `good' or `bad'. The goal of this paper is to examine all baseline methods that are independent of feature values and determine which model is the `best' and why. By identifying which baseline models are optimal, a crucial selection decision in the evaluation process is simplified. We prove that the recently proposed Dutch Draw baseline is the best input-independent classifier (independent of feature values) for all positional-invariant measures (independent of sequence order) assuming that the samples are randomly shuffled. This means that the Dutch Draw baseline is the optimal baseline under these intuitive requirements and should therefore be used in practice.  ( 2 min )
    PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks. (arXiv:2006.07794v2 [cs.LG] UPDATED)
    Large capacity deep learning models are often prone to a high generalization gap when trained with a limited amount of labeled training data. A recent class of methods to address this problem uses various ways to construct a new training sample by mixing a pair (or more) of training samples. We propose PatchUp, a hidden state block-level regularization technique for Convolutional Neural Networks (CNNs), that is applied on selected contiguous blocks of feature maps from a random pair of samples. Our approach improves the robustness of CNN models against the manifold intrusion problem that may occur in other state-of-the-art mixing approaches. Moreover, since we are mixing the contiguous block of features in the hidden space, which has more dimensions than the input space, we obtain more diverse samples for training towards different dimensions. Our experiments on CIFAR10/100, SVHN, Tiny-ImageNet, and ImageNet using ResNet architectures including PreActResnet18/34, WRN-28-10, ResNet101/152 models show that PatchUp improves upon, or equals, the performance of current state-of-the-art regularizers for CNNs. We also show that PatchUp can provide a better generalization to deformed samples and is more robust against adversarial attacks.  ( 2 min )
    A Newton-CG based augmented Lagrangian method for finding a second-order stationary point of nonconvex equality constrained optimization with complexity guarantees. (arXiv:2301.03139v1 [math.OC])
    In this paper we consider finding a second-order stationary point (SOSP) of nonconvex equality constrained optimization when a nearly feasible point is known. In particular, we first propose a new Newton-CG method for finding an approximate SOSP of unconstrained optimization and show that it enjoys a substantially better complexity than the Newton-CG method [56]. We then propose a Newton-CG based augmented Lagrangian (AL) method for finding an approximate SOSP of nonconvex equality constrained optimization, in which the proposed Newton-CG method is used as a subproblem solver. We show that under a generalized linear independence constraint qualification (GLICQ), our AL method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-7/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of nonconvex equality constrained optimization with high probability, which are significantly better than the ones achieved by the proximal AL method [60]. Besides, we show that it has a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ when the GLICQ does not hold. To the best of our knowledge, all the complexity results obtained in this paper are new for finding an approximate SOSP of nonconvex equality constrained optimization with high probability. Preliminary numerical results also demonstrate the superiority of our proposed methods over the ones in [56,60].  ( 2 min )
    Fast and Correct Gradient-Based Optimisation for Probabilistic Programming via Smoothing. (arXiv:2301.03415v1 [cs.PL])
    We study the foundations of variational inference, which frames posterior inference as an optimisation problem, for probabilistic programming. The dominant approach for optimisation in practice is stochastic gradient descent. In particular, a variant using the so-called reparameterisation gradient estimator exhibits fast convergence in a traditional statistics setting. Unfortunately, discontinuities, which are readily expressible in programming languages, can compromise the correctness of this approach. We consider a simple (higher-order, probabilistic) programming language with conditionals, and we endow our language with both a measurable and a smoothed (approximate) value semantics. We present type systems which establish technical pre-conditions. Thus we can prove stochastic gradient descent with the reparameterisation gradient estimator to be correct when applied to the smoothed problem. Besides, we can solve the original problem up to any error tolerance by choosing an accuracy coefficient suitably. Empirically we demonstrate that our approach has a similar convergence as a key competitor, but is simpler, faster, and attains orders of magnitude reduction in work-normalised variance.
    XDQN: Inherently Interpretable DQN through Mimicking. (arXiv:2301.03043v1 [cs.LG])
    Although deep reinforcement learning (DRL) methods have been successfully applied in challenging tasks, their application in real-world operational settings is challenged by methods' limited ability to provide explanations. Among the paradigms for explainability in DRL is the interpretable box design paradigm, where interpretable models substitute inner constituent models of the DRL method, thus making the DRL method "inherently" interpretable. In this paper we explore this paradigm and we propose XDQN, an explainable variation of DQN, which uses an interpretable policy model trained through mimicking. XDQN is challenged in a complex, real-world operational multi-agent problem, where agents are independent learners solving congestion problems. Specifically, XDQN is evaluated in three MARL scenarios, pertaining to the demand-capacity balancing problem of air traffic management. XDQN achieves performance similar to that of DQN, while its abilities to provide global models' interpretations and interpretations of local decisions are demonstrated.  ( 2 min )
    Differentiable Simulations for Enhanced Sampling of Rare Events. (arXiv:2301.03480v1 [physics.chem-ph])
    We develop a novel approach to enhanced sampling of chemically reactive events using differentiable simulations. We merge the reaction path discovery and biasing potential computation into one end-to-end problem and solve it by path-integral optimization. The techniques developed contribute directly to the understanding and usability of differentiable simulations as we introduce new approaches and prove the stability properties of our method.
    Modeling Label Semantics Improves Activity Recognition. (arXiv:2301.03462v1 [cs.LG])
    Human activity recognition (HAR) aims to classify sensory time series into different activities, with wide applications in activity tracking, healthcare, human computer interaction, etc. Existing HAR works improve recognition performance by designing more complicated feature extraction methods, but they neglect the label semantics by simply treating labels as integer IDs. We find that many activities in the current HAR datasets have shared label names, e.g., "open door" and "open fridge", "walk upstairs" and "walk downstairs". Through some exploratory analysis, we find that such shared structure in activity names also maps to similarity in the input features. To this end, we design a sequence-to-sequence framework to decode the label name semantics rather than classifying labels as integer IDs. Our proposed method decomposes learning activities into learning shared tokens ("open", "walk"), which is easier than learning the joint distribution ("open fridge", "walk upstairs") and helps transfer learning to activities with insufficient data samples. For datasets originally without shared tokens in label names, we also offer an automated method, using OpenAI's ChatGPT, to generate shared actions and objects. Extensive experiments on seven HAR benchmark datasets demonstrate the state-of-the-art performance of our method. We also show better performance in the long-tail activity distribution settings and few-shot settings.
    Unsupervised Multivariate Time-Series Transformers for Seizure Identification on EEG. (arXiv:2301.03470v1 [eess.SP])
    Epilepsy is one of the most common neurological disorders, typically observed via seizure episodes. Epileptic seizures are commonly monitored through electroencephalogram (EEG) recordings due to their routine and low expense collection. The stochastic nature of EEG makes seizure identification via manual inspections performed by highly-trained experts a tedious endeavor, motivating the use of automated identification. The literature on automated identification focuses mostly on supervised learning methods requiring expert labels of EEG segments that contain seizures, which are difficult to obtain. Motivated by these observations, we pose seizure identification as an unsupervised anomaly detection problem. To this end, we employ the first unsupervised transformer-based model for seizure identification on raw EEG. We train an autoencoder involving a transformer encoder via an unsupervised loss function, incorporating a novel masking strategy uniquely designed for multivariate time-series data such as EEG. Training employs EEG recordings that do not contain any seizures, while seizures are identified with respect to reconstruction errors at inference time. We evaluate our method on three publicly available benchmark EEG datasets for distinguishing seizure vs. non-seizure windows. Our method leads to significantly better seizure identification performance than supervised learning counterparts, by up to 16% recall, 9% accuracy, and 9% Area under the Receiver Operating Characteristics Curve (AUC), establishing particular benefits on highly imbalanced data. Through accurate seizure identification, our method could facilitate widely accessible and early detection of epilepsy development, without needing expensive label collection or manual feature extraction.
    MAQA: A Multimodal QA Benchmark for Negation. (arXiv:2301.03238v1 [cs.CL])
    Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted from labeled music videos in AudioSet (Gemmeke et al., 2017) with the goal of systematically evaluating if multimodal transformers can perform complex reasoning to recognize new concepts as negation of previously learned concepts. We show that with standard fine-tuning approach multimodal transformers are still incapable of correctly interpreting negation irrespective of model size. However, our experiments demonstrate that augmenting the original training task distributions with negated QA examples allow the model to reliably reason with negation. To do this, we describe a novel data generation procedure that prompts the 540B-parameter PaLM model to automatically generate negated QA examples as compositions of easily accessible video tags. The generated examples contain more natural linguistic patterns and the gains compared to template-based task augmentation approach are significant.
    Deep Learning for Short-Latency Epileptic Seizure Detection with Probabilistic Classification. (arXiv:2301.03465v1 [eess.SP])
    In this manuscript, we propose a novel deep learning (DL)-based framework intended for obtaining short latency in real-time electroencephalogram-based epileptic seizure detection using multiscale 3D convolutional neural networks. We pioneer converting seizure detection task from traditional binary classification of samples from ictal and interictal periods to probabilistic classification of samples from interictal, ictal, and crossing periods. We introduce a crossing period from seizure-oriented EEG recording and propose a labelling rule using soft-label for samples from the crossing period to build a probabilistic classification task. A novel multiscale short-time Fourier transform feature extraction method and 3D convolution neural network architecture are proposed to accurately capture predictive probabilities of samples. Furthermore, we also propose rectified weighting strategy to enhance predictive probabilities, and accumulative decision-making rule to achieve short detection latency. We implement leave-one-seizure-out cross validation on two prevalent datasets -- CHB-MIT scalp EEG dataset and SWEC-ETHZ intracranial EEG dataset. Eventually, the proposed algorithm achieved 94 out of 99 seizures detected during the crossing period, averaged 14.84% rectified predictive ictal probability (RPIP) errors of crossing samples, 2.3 s detection latency, 0.32/h false detection rate on CHB-MIT dataset, meanwhile 84 out of 89 detected seizures, 16.17% RPIP errors, 4.7 s detection latency, and 0.75/h FDR are achieved on SWEC-ETHZ dataset. The obtained detection latencies are at least 50% faster than state-of-the-art results reported in previous studies.
    Safer Together: Machine Learning Models Trained on Shared Accident Datasets Predict Construction Injuries Better than Company-Specific Models. (arXiv:2301.03567v1 [cs.LG])
    In this study, we capitalized on a collective dataset repository of 57k accidents from 9 companies belonging to 3 domains and tested whether models trained on multiple datasets (generic models) predicted safety outcomes better than the company-specific models. We experimented with full generic models (trained on all data), per-domain generic models (construction, electric T&D, oil & gas), and with ensembles of generic and specific models. Results are very positive, with generic models outperforming the company-specific models in most cases while also generating finer-grained, hence more useful, forecasts. Successful generic models remove the needs for training company-specific models, saving a lot of time and resources, and give small companies, whose accident datasets are too limited to train their own models, access to safety outcome predictions. It may still however be advantageous to train specific models to get an extra boost in performance through ensembling with the generic models. Overall, by learning lessons from a pool of datasets whose accumulated experience far exceeds that of any single company, and making these lessons easily accessible in the form of simple forecasts, generic models tackle the holy grail of safety cross-organizational learning and dissemination in the construction industry.
    Cursive Caption Text Detection in Videos. (arXiv:2301.03164v1 [cs.CV])
    Textual content appearing in videos represents an interesting index for semantic retrieval of videos (from archives), generation of alerts (live streams) as well as high level applications like opinion mining and content summarization. One of the key components of such systems is the detection of textual content in video frames and the same makes the subject of our present study. This paper presents a robust technique for detection of textual content appearing in video frames. More specifically we target text in cursive script taking Urdu text as a case study. Detection of textual regions in video frames is carried out by fine-tuning object detectors based on deep convolutional neural networks for the specific case of text detection. Since it is common to have videos with caption text in multiple-scripts, cursive text is distinguished from Latin text using a script-identification module. Finally, detection and script identification are combined in a single end-to-end trainable system. Experiments on a comprehensive dataset of around 11,000 video frames report an F-measure of 0.91.
    eFIN: Enhanced Fourier Imager Network for generalizable autofocusing and pixel super-resolution in holographic imaging. (arXiv:2301.03162v1 [physics.optics])
    The application of deep learning techniques has greatly enhanced holographic imaging capabilities, leading to improved phase recovery and image reconstruction. Here, we introduce a deep neural network termed enhanced Fourier Imager Network (eFIN) as a highly generalizable framework for hologram reconstruction with pixel super-resolution and image autofocusing. Through holographic microscopy experiments involving lung, prostate and salivary gland tissue sections and Papanicolau (Pap) smears, we demonstrate that eFIN has a superior image reconstruction quality and exhibits external generalization to new types of samples never seen during the training phase. This network achieves a wide autofocusing axial range of 0.35 mm, with the capability to accurately predict the hologram axial distances by physics-informed learning. eFIN enables 3x pixel super-resolution imaging and increases the space-bandwidth product of the reconstructed images by 9-fold with almost no performance loss, which allows for significant time savings in holographic imaging and data processing steps. Our results showcase the advancements of eFIN in pushing the boundaries of holographic imaging for various applications in e.g., quantitative phase imaging and label-free microscopy.  ( 2 min )
    KIDS: kinematics-based (in)activity detection and segmentation in a sleep case study. (arXiv:2301.03469v1 [eess.SP])
    Sleep behaviour and in-bed movements contain rich information on the neurophysiological health of people, and have a direct link to the general well-being and quality of life. Standard clinical practices rely on polysomnography for sleep assessment; however, it is intrusive, performed in unfamiliar environments and requires trained personnel. Progress has been made on less invasive sensor technologies, such as actigraphy, but clinical validation raises concerns over their reliability and precision. Additionally, the field lacks a widely acceptable algorithm, with proposed approaches ranging from raw signal or feature thresholding to data-hungry classification models, many of which are unfamiliar to medical staff. This paper proposes an online Bayesian probabilistic framework for objective (in)activity detection and segmentation based on clinically meaningful joint kinematics, measured by a custom-made wearable sensor. Intuitive three-dimensional visualisations of kinematic timeseries were accomplished through dimension reduction based preprocessing, offering out-of-the-box framework explainability potentially useful for clinical monitoring and diagnosis. The proposed framework attained up to 99.2\% $F_1$-score and 0.96 Pearson's correlation coefficient in, respectively, the posture change detection and inactivity segmentation tasks. The work paves the way for a reliable home-based analysis of movements during sleep which would serve patient-centred longitudinal care plans.
    Network Slicing via Transfer Learning aided Distributed Deep Reinforcement Learning. (arXiv:2301.03262v1 [cs.NI])
    Deep reinforcement learning (DRL) has been increasingly employed to handle the dynamic and complex resource management in network slicing. The deployment of DRL policies in real networks, however, is complicated by heterogeneous cell conditions. In this paper, we propose a novel transfer learning (TL) aided multi-agent deep reinforcement learning (MADRL) approach with inter-agent similarity analysis for inter-cell inter-slice resource partitioning. First, we design a coordinated MADRL method with information sharing to intelligently partition resource to slices and manage inter-cell interference. Second, we propose an integrated TL method to transfer the learned DRL policies among different local agents for accelerating the policy deployment. The method is composed of a new domain and task similarity measurement approach and a new knowledge transfer approach, which resolves the problem of from whom to transfer and how to transfer. We evaluated the proposed solution with extensive simulations in a system-level simulator and show that our approach outperforms the state-of-the-art solutions in terms of performance, convergence speed and sample efficiency. Moreover, by applying TL, we achieve an additional gain over 27% higher than the coordinate MADRL approach without TL.
    On the challenges to learn from Natural Data Streams. (arXiv:2301.03495v1 [cs.CV])
    In real-world contexts, sometimes data are available in form of Natural Data Streams, i.e. data characterized by a streaming nature, unbalanced distribution, data drift over a long time frame and strong correlation of samples in short time ranges. Moreover, a clear separation between the traditional training and deployment phases is usually lacking. This data organization and fruition represents an interesting and challenging scenario for both traditional Machine and Deep Learning algorithms and incremental learning agents, i.e. agents that have the ability to incrementally improve their knowledge through the past experience. In this paper, we investigate the classification performance of a variety of algorithms that belong to various research field, i.e. Continual, Streaming and Online Learning, that receives as training input Natural Data Streams. The experimental validation is carried out on three different datasets, expressly organized to replicate this challenging setting.
    Deep Learning for Mean Field Games with non-separable Hamiltonians. (arXiv:2301.02877v1 [cs.LG])
    This paper introduces a new method based on Deep Galerkin Methods (DGMs) for solving high-dimensional stochastic Mean Field Games (MFGs). We achieve this by using two neural networks to approximate the unknown solutions of the MFG system and forward-backward conditions. Our method is efficient, even with a small number of iterations, and is capable of handling up to 300 dimensions with a single layer, which makes it faster than other approaches. In contrast, methods based on Generative Adversarial Networks (GANs) cannot solve MFGs with non-separable Hamiltonians. We demonstrate the effectiveness of our approach by applying it to a traffic flow problem, which was previously solved using the Newton iteration method only in the deterministic case. We compare the results of our method to analytical solutions and previous approaches, showing its efficiency. We also prove the convergence of our neural network approximation with a single hidden layer using the universal approximation theorem.  ( 2 min )
    Stochastic Langevin Monte Carlo for (weakly) log-concave posterior distributions. (arXiv:2301.03077v1 [stat.ML])
    In this paper, we investigate a continuous time version of the Stochastic Langevin Monte Carlo method, introduced in [WT11], that incorporates a stochastic sampling step inside the traditional over-damped Langevin diffusion. This method is popular in machine learning for sampling posterior distribution. We will pay specific attention in our work to the computational cost in terms of $n$ (the number of observations that produces the posterior distribution), and $d$ (the dimension of the ambient space where the parameter of interest is living). We derive our analysis in the weakly convex framework, which is parameterized with the help of the Kurdyka-\L ojasiewicz (KL) inequality, that permits to handle a vanishing curvature settings, which is far less restrictive when compared to the simple strongly convex case. We establish that the final horizon of simulation to obtain an $\varepsilon$ approximation (in terms of entropy) is of the order $( d \log(n)^2 )^{(1+r)^2} [\log^2(\varepsilon^{-1}) + n^2 d^{2(1+r)} \log^{4(1+r)}(n) ]$ with a Poissonian subsampling of parameter $\left(n ( d \log^2(n))^{1+r}\right)^{-1}$, where the parameter $r$ is involved in the KL inequality and varies between $0$ (strongly convex case) and $1$ (limiting Laplace situation).  ( 2 min )
    LS-DYNA Machine Learning-based Multiscale Method for Nonlinear Modeling of Short Fiber-Reinforced Composites. (arXiv:2301.02738v1 [cs.CE])
    Short-fiber-reinforced composites (SFRC) are high-performance engineering materials for lightweight structural applications in the automotive and electronics industries. Typically, SFRC structures are manufactured by injection molding, which induces heterogeneous microstructures, and the resulting nonlinear anisotropic behaviors are challenging to predict by conventional micromechanical analyses. In this work, we present a machine learning-based multiscale method by integrating injection molding-induced microstructures, material homogenization, and Deep Material Network (DMN) in the finite element simulation software LS-DYNA for structural analysis of SFRC. DMN is a physics-embedded machine learning model that learns the microscale material morphologies hidden in representative volume elements of composites through offline training. By coupling DMN with finite elements, we have developed a highly accurate and efficient data-driven approach, which predicts nonlinear behaviors of composite materials and structures at a computational speed orders-of-magnitude faster than the high-fidelity direct numerical simulation. To model industrial-scale SFRC products, transfer learning is utilized to generate a unified DMN database, which effectively captures the effects of injection molding-induced fiber orientations and volume fractions on the overall composite properties. Numerical examples are presented to demonstrate the promising performance of this LS-DYNA machine learning-based multiscale method for SFRC modeling.  ( 2 min )
    GAN-Based Content Generation of Maps for Strategy Games. (arXiv:2301.02874v1 [cs.LG])
    Maps are a very important component of strategy games, and a time-consuming task if done by hand. Maps generated by traditional PCG techniques such as Perlin noise or tile-based PCG techniques look unnatural and unappealing, thus not providing the best user experience for the players. However it is possible to have a generator that can create realistic and natural images of maps, given that it is trained how to do so. We propose a model for the generation of maps based on Generative Adversarial Networks (GAN). In our implementation we tested out different variants of GAN-based networks on a dataset of heightmaps. We conducted extensive empirical evaluation to determine the advantages and properties of each approach. The results obtained are promising, showing that it is indeed possible to generate realistic looking maps using this type of approach.  ( 2 min )
    Sublinear Time Algorithms for Several Geometric Optimization (With Outliers) Problems In Machine Learning. (arXiv:2301.02870v1 [cs.DS])
    In this paper, we study several important geometric optimization problems arising in machine learning. First, we revisit the Minimum Enclosing Ball (MEB) problem in Euclidean space $\mathbb{R}^d$. The problem has been extensively studied before, but real-world machine learning tasks often need to handle large-scale datasets so that we cannot even afford linear time algorithms. Motivated by the recent studies on {\em beyond worst-case analysis}, we introduce the notion of stability for MEB, which is natural and easy to understand. Roughly speaking, an instance of MEB is stable, if the radius of the resulting ball cannot be significantly reduced by removing a small fraction of the input points. Under the stability assumption, we present two sampling algorithms for computing radius-approximate MEB with sample complexities independent of the number of input points $n$. In particular, the second algorithm has the sample complexity even independent of the dimensionality $d$. We also consider the general case without the stability assumption. We present a hybrid algorithm that can output either a radius-approximate MEB or a covering-approximate MEB. Our algorithm improves the running time and the number of passes for the previous sublinear MEB algorithms. Our method relies on two novel techniques, the Uniform-Adaptive Sampling method and Sandwich Lemma. Furthermore, we observe that these two techniques can be generalized to design sublinear time algorithms for a broader range of geometric optimization problems with outliers in high dimensions, including MEB with outliers, one-class and two-class linear SVMs with outliers, $k$-center clustering with outliers, and flat fitting with outliers. Our proposed algorithms also work fine for kernels.  ( 2 min )
    A Survey on Transformers in Reinforcement Learning. (arXiv:2301.03044v1 [cs.LG])
    Transformer has been considered the dominating neural architecture in NLP and CV, mostly under a supervised setting. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. Hence, in this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.
    Self-Supervised Time-to-Event Modeling with Structured Medical Records. (arXiv:2301.03150v1 [cs.LG])
    Time-to-event models (also known as survival models) are used in medicine and other fields for estimating the probability distribution of the time until a particular event occurs. While providing many advantages over traditional classification models, such as naturally handling censoring, time-to-event models require more parameters and are challenging to learn in settings with limited labeled training data. High censoring rates, common in events with long time horizons, further limit available training data and exacerbate the risk of overfitting. Existing methods, such as proportional hazard or accelerated failure time-based approaches, employ distributional assumptions to reduce parameter size, but they are vulnerable to model misspecification. In this work, we address these challenges with MOTOR, a self-supervised model that leverages temporal structure found in large-scale collections of timestamped, but largely unlabeled events, typical of electronic health record data. MOTOR defines a time-to-event pretraining task that naturally captures the probability distribution of event times, making it well-suited to applications in medicine. After pretraining on 8,192 tasks auto-generated from 2.7M patients (2.4B clinical events), we evaluate the performance of our pretrained model after fine-tuning to unseen time-to-event tasks. MOTOR-derived models improve upon current state-of-the-art C statistic performance by 6.6% and decrease training time (in wall time) by up to 8.2 times. We further improve sample efficiency, with adapted models matching current state-of-the-art performance using 95% less training data.
    AI2: The next leap toward native language based and explainable machine learning framework. (arXiv:2301.03391v1 [cs.LG])
    The machine learning frameworks flourished in the last decades, allowing artificial intelligence to get out of academic circles to be applied to enterprise domains. This field has significantly advanced, but there is still some meaningful improvement to reach the subsequent expectations. The proposed framework, named AI$^{2}$, uses a natural language interface that allows a non-specialist to benefit from machine learning algorithms without necessarily knowing how to program with a programming language. The primary contribution of the AI$^{2}$ framework allows a user to call the machine learning algorithms in English, making its interface usage easier. The second contribution is greenhouse gas (GHG) awareness. It has some strategies to evaluate the GHG generated by the algorithm to be called and to propose alternatives to find a solution without executing the energy-intensive algorithm. Another contribution is a preprocessing module that helps to describe and to load data properly. Using an English text-based chatbot, this module guides the user to define every dataset so that it can be described, normalized, loaded and divided appropriately. The last contribution of this paper is about explainability. For decades, the scientific community has known that machine learning algorithms imply the famous black-box problem. Traditional machine learning methods convert an input into an output without being able to justify this result. The proposed framework explains the algorithm's process with the proper texts, graphics and tables. The results, declined in five cases, present usage applications from the user's English command to the explained output. Ultimately, the AI$^{2}$ framework represents the next leap toward native language-based, human-oriented concerns about machine learning framework.
    Fair Multi-Exit Framework for Facial Attribute Classification. (arXiv:2301.02989v1 [cs.CV])
    Fairness has become increasingly pivotal in facial recognition. Without bias mitigation, deploying unfair AI would harm the interest of the underprivileged population. In this paper, we observe that though the higher accuracy that features from the deeper layer of a neural networks generally offer, fairness conditions deteriorate as we extract features from deeper layers. This phenomenon motivates us to extend the concept of multi-exit framework. Unlike existing works mainly focusing on accuracy, our multi-exit framework is fairness-oriented, where the internal classifiers are trained to be more accurate and fairer. During inference, any instance with high confidence from an internal classifier is allowed to exit early. Moreover, our framework can be applied to most existing fairness-aware frameworks. Experiment results show that the proposed framework can largely improve the fairness condition over the state-of-the-art in CelebA and UTK Face datasets.
    Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets. (arXiv:2301.03373v1 [cs.LG])
    The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology, including email, internet browsing, robotics, and malicious software. The original contribution of this study was to examine half of the most significant code advances in the last 50 years and, in some cases, to provide notable improvements in clarity or performance. The AI-driven code assistant could provide insights into obfuscated code or software lacking explanatory commentary in all cases examined. We generated additional sample problems based on bug corrections and code optimizations requiring much deeper reasoning than a traditional Google search might provide. Future work focuses on adding automated documentation and code commentary and translating select large code bases into more modern versions with multiple new application programming interfaces (APIs) and chained multi-tasks. The AI-driven code assistant offers a valuable tool for software engineering, particularly in its ability to provide human-level expertise and assist in refactoring legacy code or simplifying the explanation or functionality of high-value repositories.
    CaSpeR: Latent Spectral Regularization for Continual Learning. (arXiv:2301.03345v1 [cs.LG])
    While biological intelligence grows organically as new knowledge is gathered throughout life, Artificial Neural Networks forget catastrophically whenever they face a changing training data distribution. Rehearsal-based Continual Learning (CL) approaches have been established as a versatile and reliable solution to overcome this limitation; however, sudden input disruptions and memory constraints are known to alter the consistency of their predictions. We study this phenomenon by investigating the geometric characteristics of the learner's latent space and find that replayed data points of different classes increasingly mix up, interfering with classification. Hence, we propose a geometric regularizer that enforces weak requirements on the Laplacian spectrum of the latent space, promoting a partitioning behavior. We show that our proposal, called Continual Spectral Regularizer (CaSpeR), can be easily combined with any rehearsal-based CL approach and improves the performance of SOTA methods on standard benchmarks. Finally, we conduct additional analysis to provide insights into CaSpeR's effects and applicability.
    Neighbor Auto-Grouping Graph Neural Networks for Handover Parameter Configuration in Cellular Network. (arXiv:2301.03412v1 [cs.NI])
    The mobile communication enabled by cellular networks is the one of the main foundations of our modern society. Optimizing the performance of cellular networks and providing massive connectivity with improved coverage and user experience has a considerable social and economic impact on our daily life. This performance relies heavily on the configuration of the network parameters. However, with the massive increase in both the size and complexity of cellular networks, network management, especially parameter configuration, is becoming complicated. The current practice, which relies largely on experts' prior knowledge, is not adequate and will require lots of domain experts and high maintenance costs. In this work, we propose a learning-based framework for handover parameter configuration. The key challenge, in this case, is to tackle the complicated dependencies between neighboring cells and jointly optimize the whole network. Our framework addresses this challenge in two ways. First, we introduce a novel approach to imitate how the network responds to different network states and parameter values, called auto-grouping graph convolutional network (AG-GCN). During the parameter configuration stage, instead of solving the global optimization problem, we design a local multi-objective optimization strategy where each cell considers several local performance metrics to balance its own performance and its neighbors. We evaluate our proposed algorithm via a simulator constructed using real network data. We demonstrate that the handover parameters our model can find, achieve better average network throughput compared to those recommended by experts as well as alternative baselines, which can bring better network quality and stability. It has the potential to massively reduce costs arising from human expert intervention and maintenance.
    Towards an AI-enabled Connected Industry: AGV Communication and Sensor Measurement Datasets. (arXiv:2301.03364v1 [cs.NI])
    This paper presents two wireless measurement campaigns in industrial testbeds: industrial Vehicle-to-vehicle (iV2V) and industrial Vehicle-to-infrastructure plus Sensor (iV2I+). Detailed information about the two captured datasets is provided as well. iV2V covers sidelink communication scenarios between Automated Guided Vehicles (AGVs), while iV2I+ is conducted at an industrial setting where an autonomous cleaning robot is connected to a private cellular network. The combination of different communication technologies, together with a common measurement methodology, provides insights that can be exploited by Machine Learning (ML) for tasks such as fingerprinting, line-of-sight detection, prediction of quality of service or link selection. Moreover, the datasets are labelled and pre-filtered for fast on-boarding and applicability. The corresponding testbeds and measurements are also presented in detail for both datasets.
    Fully Dynamic Online Selection through Online Contention Resolution Schemes. (arXiv:2301.03099v1 [cs.AI])
    We study fully dynamic online selection problems in an adversarial/stochastic setting that includes Bayesian online selection, prophet inequalities, posted price mechanisms, and stochastic probing problems subject to combinatorial constraints. In the classical ``incremental'' version of the problem, selected elements remain active until the end of the input sequence. On the other hand, in the fully dynamic version of the problem, elements stay active for a limited time interval, and then leave. This models, for example, the online matching of tasks to workers with task/worker-dependent working times, and sequential posted pricing of perishable goods. A successful approach to online selection problems in the adversarial setting is given by the notion of Online Contention Resolution Scheme (OCRS), that uses a priori information to formulate a linear relaxation of the underlying optimization problem, whose optimal fractional solution is rounded online for any adversarial order of the input sequence. Our main contribution is providing a general method for constructing an OCRS for fully dynamic online selection problems. Then, we show how to employ such OCRS to construct no-regret algorithms in a partial information model with semi-bandit feedback and adversarial inputs.
    Emotion Recognition from Microblog Managing Emoticon with Text and Classifying using 1D CNN. (arXiv:2301.02971v1 [cs.LG])
    Microblog, an online-based broadcast medium, is a widely used forum for people to share their thoughts and opinions. Recently, Emotion Recognition (ER) from microblogs is an inspiring research topic in diverse areas. In the machine learning domain, automatic emotion recognition from microblogs is a challenging task, especially, for better outcomes considering diverse content. Emoticon becomes very common in the text of microblogs as it reinforces the meaning of content. This study proposes an emotion recognition scheme considering both the texts and emoticons from microblog data. Emoticons are considered unique expressions of the users' emotions and can be changed by the proper emotional words. The succession of emoticons appearing in the microblog data is preserved and a 1D Convolutional Neural Network (CNN) is employed for emotion classification. The experimental result shows that the proposed emotion recognition scheme outperforms the other existing methods while tested on Twitter data.
    Learning Optimal Phase-Shifts of Holographic Metasurface Transceivers. (arXiv:2301.03371v1 [eess.SP])
    Holographic metasurface transceivers (HMT) is an emerging technology for enhancing the coverage and rate of wireless communication systems. However, acquiring accurate channel state information in HMT-assisted wireless communication systems is critical for achieving these goals. In this paper, we propose an algorithm for learning the optimal phase-shifts at a HMT for the far-field channel model. Our proposed algorithm exploits the structure of the channel gains in the far-field regions and learns the optimal phase-shifts in presence of noise in the received signals. We prove that the probability that the optimal phase-shifts estimated by our proposed algorithm deviate from the true values decays exponentially in the number of pilot signals. Extensive numerical simulations validate the theoretical guarantees and also demonstrate significant gains as compared to the state-of-the-art policies.
    Machine Learning for Large-Scale Optimization in 6G Wireless Networks. (arXiv:2301.03377v1 [eess.SP])
    The sixth generation (6G) wireless systems are envisioned to enable the paradigm shift from "connected things" to "connected intelligence", featured by ultra high density, large-scale, dynamic heterogeneity, diversified functional requirements and machine learning capabilities, which leads to a growing need for highly efficient intelligent algorithms. The classic optimization-based algorithms usually require highly precise mathematical model of data links and suffer from poor performance with high computational cost in realistic 6G applications. Based on domain knowledge (e.g., optimization models and theoretical tools), machine learning (ML) stands out as a promising and viable methodology for many complex large-scale optimization problems in 6G, due to its superior performance, generalizability, computational efficiency and robustness. In this paper, we systematically review the most representative "learning to optimize" techniques in diverse domains of 6G wireless networks by identifying the inherent feature of the underlying optimization problem and investigating the specifically designed ML frameworks from the perspective of optimization. In particular, we will cover algorithm unrolling, learning to branch-and-bound, graph neural network for structured optimization, deep reinforcement learning for stochastic optimization, end-to-end learning for semantic optimization, as well as federated learning for distributed optimization, for solving challenging large-scale optimization problems arising from various important wireless applications. Through the in-depth discussion, we shed light on the excellent performance of ML-based optimization algorithms with respect to the classical methods, and provide insightful guidance to develop advanced ML techniques in 6G networks.
    A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding. (arXiv:2301.03403v1 [cs.CL])
    We provide a literature review about Automatic Text Summarization (ATS) systems. We consider a citation-based approach. We start with some popular and well-known papers that we have in hand about each topic we want to cover and we have tracked the "backward citations" (papers that are cited by the set of papers we knew beforehand) and the "forward citations" (newer papers that cite the set of papers we knew beforehand). In order to organize the different methods, we present the diverse approaches to ATS guided by the mechanisms they use to generate a summary. Besides presenting the methods, we also present an extensive review of the datasets available for summarization tasks and the methods used to evaluate the quality of the summaries. Finally, we present an empirical exploration of these methods using the CNN Corpus dataset that provides golden summaries for extractive and abstractive methods.
    Unsupervised ensemble-based phenotyping helps enhance the discoverability of genes related to heart morphology. (arXiv:2301.02916v1 [q-bio.GN])
    Recent genome-wide association studies (GWAS) have been successful in identifying associations between genetic variants and simple cardiac parameters derived from cardiac magnetic resonance (CMR) images. However, the emergence of big databases including genetic data linked to CMR, facilitates investigation of more nuanced patterns of shape variability. Here, we propose a new framework for gene discovery entitled Unsupervised Phenotype Ensembles (UPE). UPE builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner, using deep learning models trained with different hyperparameters. These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations across the ensemble. We apply our approach to the UK Biobank database to extract left-ventricular (LV) geometric features from image-derived three-dimensional meshes. We demonstrate that our approach greatly improves the discoverability of genes influencing LV shape, identifying 11 loci with study-wide significance and 8 with suggestive significance. We argue that our approach would enable more extensive discovery of gene associations with image-derived phenotypes for other organs or image modalities.
    Finding Lookalike Customers for E-Commerce Marketing. (arXiv:2301.03147v1 [cs.LG])
    Customer-centric marketing campaigns generate a large portion of e-commerce website traffic for Walmart. As the scale of customer data grows larger, expanding the marketing audience to reach more customers is becoming more critical for e-commerce companies to drive business growth and bring more value to customers. In this paper, we present a scalable and efficient system to expand targeted audience of marketing campaigns, which can handle hundreds of millions of customers. We use a deep learning based embedding model to represent customers and an approximate nearest neighbor search method to quickly find lookalike customers of interest. The model can deal with various business interests by constructing interpretable and meaningful customer similarity metrics. We conduct extensive experiments to demonstrate the great performance of our system and customer embedding model.
    "It's a Match!" -- A Benchmark of Task Affinity Scores for Joint Learning. (arXiv:2301.02873v1 [cs.LG])
    While the promises of Multi-Task Learning (MTL) are attractive, characterizing the conditions of its success is still an open problem in Deep Learning. Some tasks may benefit from being learned together while others may be detrimental to one another. From a task perspective, grouping cooperative tasks while separating competing tasks is paramount to reap the benefits of MTL, i.e., reducing training and inference costs. Therefore, estimating task affinity for joint learning is a key endeavor. Recent work suggests that the training conditions themselves have a significant impact on the outcomes of MTL. Yet, the literature is lacking of a benchmark to assess the effectiveness of tasks affinity estimation techniques and their relation with actual MTL performance. In this paper, we take a first step in recovering this gap by (i) defining a set of affinity scores by both revisiting contributions from previous literature as well presenting new ones and (ii) benchmarking them on the Taskonomy dataset. Our empirical campaign reveals how, even in a small-scale scenario, task affinity scoring does not correlate well with actual MTL performance. Yet, some metrics can be more indicative than others.  ( 2 min )
    Attention-LSTM for Multivariate Traffic State Prediction on Rural Roads. (arXiv:2301.02731v1 [cs.LG])
    Accurate traffic volume and speed prediction have a wide range of applications in transportation. It can result in useful and timely information for both travellers and transportation decision-makers. In this study, an Attention based Long Sort-Term Memory model (A-LSTM) is proposed to simultaneously predict traffic volume and speed in a critical rural road segmentation which connects Tehran to Chalus, the most tourist destination city in Iran. Moreover, this study compares the results of the A-LSTM model with the Long Short-Term Memory (LSTM) model. Both models show acceptable performance in predicting speed and flow. However, the A-LSTM model outperforms the LSTM in 5 and 15-minute intervals. In contrast, there is no meaningful difference between the two models for the 30-minute time interval. By comparing the performance of the models based on different time horizons, the 15-minute horizon model outperforms the others by reaching the lowest Mean Square Error (MSE) loss of 0.0032, followed by the 30 and 5-minutes horizons with 0.004 and 0.0051, respectively. In addition, this study compares the results of the models based on two transformations of temporal categorical input variables, one-hot or cyclic, for the 15-minute time interval. The results demonstrate that both LSTM and A-LSTM with cyclic feature encoding outperform those with one-hot feature encoding.  ( 2 min )
    Optimistic Meta-Gradients. (arXiv:2301.03236v1 [cs.LG])
    We study the connection between gradient-based meta-learning and convex op-timisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through Bootstrapped Meta-Gradients (Flennerhag et al., 2022), providing deeper insight into its underlying mechanics.
    Deep Learning in Random Neural Fields: Numerical Experiments via Neural Tangent Kernel. (arXiv:2202.05254v2 [cs.LG] UPDATED)
    A biological neural network in the cortex forms a neural field. Neurons in the field have their own receptive fields, and connection weights between two neurons are random but highly correlated when they are in close proximity in receptive fields. In this paper, we investigate such neural fields in a multilayer architecture to investigate the supervised learning of the fields. We empirically compare the performances of our field model with those of randomly connected deep networks. The behavior of a randomly connected network is investigated on the basis of the key idea of the neural tangent kernel regime, a recent development in the machine learning theory of over-parameterized networks; for most randomly connected neural networks, it is shown that global minima always exist in their small neighborhoods. We numerically show that this claim also holds for our neural fields. In more detail, our model has two structures: i) each neuron in a field has a continuously distributed receptive field, and ii) the initial connection weights are random but not independent, having correlations when the positions of neurons are close in each layer. We show that such a multilayer neural field is more robust than conventional models when input patterns are deformed by noise disturbances. Moreover, its generalization ability can be slightly superior to that of conventional models.
    Introducing Model Inversion Attacks on Automatic Speaker Recognition. (arXiv:2301.03206v1 [cs.SD])
    Model inversion (MI) attacks allow to reconstruct average per-class representations of a machine learning (ML) model's training data. It has been shown that in scenarios where each class corresponds to a different individual, such as face classifiers, this represents a severe privacy risk. In this work, we explore a new application for MI: the extraction of speakers' voices from a speaker recognition system. We present an approach to (1) reconstruct audio samples from a trained ML model and (2) extract intermediate voice feature representations which provide valuable insights into the speakers' biometrics. Therefore, we propose an extension of MI attacks which we call sliding model inversion. Our sliding MI extends standard MI by iteratively inverting overlapping chunks of the audio samples and thereby leveraging the sequential properties of audio data for enhanced inversion performance. We show that one can use the inverted audio data to generate spoofed audio samples to impersonate a speaker, and execute voice-protected commands for highly secured systems on their behalf. To the best of our knowledge, our work is the first one extending MI attacks to audio data, and our results highlight the security risks resulting from the extraction of the biometric data in that setup.
    Randomized Greedy Algorithms and Composable Coreset for k-Center Clustering with Outliers. (arXiv:2301.02814v1 [cs.LG])
    In this paper, we study the problem of {\em $k$-center clustering with outliers}. The problem has many important applications in real world, but the presence of outliers can significantly increase the computational complexity. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez's algorithm, that was developed for solving the ordinary $k$-center clustering problem. Based on some novel observations, we show that a simple randomized version of this greedy strategy actually can handle outliers efficiently. We further show that this randomized greedy approach also yields small coreset for the problem in doubling metrics (even if the doubling dimension is not given), which can greatly reduce the computational complexity. Moreover, together with the partial clustering framework proposed in arXiv:1703.01539 , we prove that our coreset method can be applied to distributed data with a low communication complexity. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower complexities comparing with the existing methods.  ( 2 min )
    Markov Chain Concentration with an Application in Reinforcement Learning. (arXiv:2301.02926v1 [cs.LG])
    Given $X_1,\cdot ,X_N$ random variables whose joint distribution is given as $\mu$ we will use the Martingale Method to show any Lipshitz Function $f$ over these random variables is subgaussian. The Variance parameter however can have a simple expression under certain conditions. For example under the assumption that the random variables follow a Markov Chain and that the function is Lipschitz under a Weighted Hamming Metric. We shall conclude with certain well known techniques from concentration of suprema of random processes with applications in Reinforcement Learning
    Minimax Weight Learning for Absorbing MDPs. (arXiv:2301.03183v1 [cs.LG])
    Reinforcement learning policy evaluation problems are often modeled as finite or discounted/averaged infinite-horizon MDPs. In this paper, we study undiscounted off-policy policy evaluation for absorbing MDPs. Given the dataset consisting of the i.i.d episodes with a given truncation level, we propose a so-called MWLA algorithm to directly estimate the expected return via the importance ratio of the state-action occupancy measure. The Mean Square Error (MSE) bound for the MWLA method is investigated and the dependence of statistical errors on the data size and the truncation level are analyzed. With an episodic taxi environment, computational experiments illustrate the performance of the MWLA algorithm.
    How to Allocate your Label Budget? Choosing between Active Learning and Learning to Reject in Anomaly Detection. (arXiv:2301.02909v1 [cs.LG])
    Anomaly detection attempts at finding examples that deviate from the expected behaviour. Usually, anomaly detection is tackled from an unsupervised perspective because anomalous labels are rare and difficult to acquire. However, the lack of labels makes the anomaly detector have high uncertainty in some regions, which usually results in poor predictive performance or low user trust in the predictions. One can reduce such uncertainty by collecting specific labels using Active Learning (AL), which targets examples close to the detector's decision boundary. Alternatively, one can increase the user trust by allowing the detector to abstain from making highly uncertain predictions, which is called Learning to Reject (LR). One way to do this is by thresholding the detector's uncertainty based on where its performance is low, which requires labels to be evaluated. Although both AL and LR need labels, they work with different types of labels: AL seeks strategic labels, which are evidently biased, while LR requires i.i.d. labels to evaluate the detector's performance and set the rejection threshold. Because one usually has a unique label budget, deciding how to optimally allocate it is challenging. In this paper, we propose a mixed strategy that, given a budget of labels, decides in multiple rounds whether to use the budget to collect AL labels or LR labels. The strategy is based on a reward function that measures the expected gain when allocating the budget to either side. We evaluate our strategy on 18 benchmark datasets and compare it to some baselines.  ( 2 min )
    Active Deep Learning Guided by Efficient Gaussian Process Surrogates. (arXiv:2301.02761v1 [cs.LG])
    The success of active learning relies on the exploration of the underlying data-generating distributions, populating sparsely labeled data areas, and exploitation of the information about the task gained by the baseline (neural network) learners. In this paper, we present a new algorithm that combines these two active learning modes. Our algorithm adopts a Bayesian surrogate for the baseline learner, and it optimizes the exploration process by maximizing the gain of information caused by new labels. Further, by instantly updating the surrogate learner for each new data instance, our model can faithfully simulate and exploit the continuous learning behavior of the learner without having to actually retrain it per label. In experiments with four benchmark classification datasets, our method demonstrated significant performance gain over state-of-the-arts.
    A Characterization of Multilabel Learnability. (arXiv:2301.02729v1 [cs.LG])
    We consider the problem of multilabel classification and investigate learnability in batch and online settings. In both settings, we show that a multilabel function class is learnable if and only if each single-label restriction of the function class is learnable. As extensions, we also study multioutput regression in the batch setting and bandit feedback in the online setting. For the former, we characterize learnability w.r.t. $L_p$ losses. For the latter, we show a similar characterization as in the full-feedback setting.
    Faithful and Consistent Graph Neural Network Explanations with Rationale Alignment. (arXiv:2301.02791v1 [cs.LG])
    Uncovering rationales behind predictions of graph neural networks (GNNs) has received increasing attention over recent years. Instance-level GNN explanation aims to discover critical input elements, like nodes or edges, that the target GNN relies upon for making predictions. %These identified sub-structures can provide interpretations of GNN's behavior. Though various algorithms are proposed, most of them formalize this task by searching the minimal subgraph which can preserve original predictions. However, an inductive bias is deep-rooted in this framework: several subgraphs can result in the same or similar outputs as the original graphs. Consequently, they have the danger of providing spurious explanations and failing to provide consistent explanations. Applying them to explain weakly-performed GNNs would further amplify these issues. To address this problem, we theoretically examine the predictions of GNNs from the causality perspective. Two typical reasons for spurious explanations are identified: confounding effect of latent variables like distribution shift, and causal factors distinct from the original input. Observing that both confounding effects and diverse causal rationales are encoded in internal representations, \tianxiang{we propose a new explanation framework with an auxiliary alignment loss, which is theoretically proven to be optimizing a more faithful explanation objective intrinsically. Concretely for this alignment loss, a set of different perspectives are explored: anchor-based alignment, distributional alignment based on Gaussian mixture models, mutual-information-based alignment, etc. A comprehensive study is conducted both on the effectiveness of this new framework in terms of explanation faithfulness/consistency and on the advantages of these variants.
    Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement. (arXiv:2301.03028v1 [cs.LG])
    Time series forecasting has been a widely explored task of great importance in many applications. However, it is common that real-world time series data are recorded in a short time period, which results in a big gap between the deep model and the limited and noisy time series. In this work, we propose to address the time series forecasting problem with generative modeling and propose a bidirectional variational auto-encoder (BVAE) equipped with diffusion, denoise, and disentanglement, namely D3VAE. Specifically, a coupled diffusion probabilistic model is proposed to augment the time series data without increasing the aleatoric uncertainty and implement a more tractable inference process with BVAE. To ensure the generated series move toward the true target, we further propose to adapt and integrate the multiscale denoising score matching into the diffusion process for time series forecasting. In addition, to enhance the interpretability and stability of the prediction, we treat the latent variable in a multivariate manner and disentangle them on top of minimizing total correlation. Extensive experiments on synthetic and real-world data show that D3VAE outperforms competitive algorithms with remarkable margins. Our implementation is available at https://github.com/PaddlePaddle/PaddleSpatial/tree/main/research/D3VAE.
    Prognosis and Treatment Prediction of Type-2 Diabetes Using Deep Neural Network and Machine Learning Classifiers. (arXiv:2301.03093v1 [cs.LG])
    Type 2 Diabetes is a fast-growing, chronic metabolic disorder due to imbalanced insulin activity.The motion of this research is a comparative study of seven machine learning classifiers and an artificial neural network method to prognosticate the detection and treatment of diabetes with high accuracy,in order to identify and treat diabetes patients at an early age.Our training and test dataset is an accumulation of 9483 diabetes patients information.The training dataset is large enough to negate overfitting and provide for highly accurate test performance.We use performance measures such as accuracy and precision to find out the best algorithm deep ANN which outperforms with 95.14% accuracy among all other tested machine learning classifiers.We hope our high-performing model can be used by hospitals to predict diabetes and drive research into more accurate prediction models.
    REaaS: Enabling Adversarially Robust Downstream Classifiers via Robust Encoder as a Service. (arXiv:2301.02905v1 [cs.CR])
    Encoder as a service is an emerging cloud service. Specifically, a service provider first pre-trains an encoder (i.e., a general-purpose feature extractor) via either supervised learning or self-supervised learning and then deploys it as a cloud service API. A client queries the cloud service API to obtain feature vectors for its training/testing inputs when training/testing its classifier (called downstream classifier). A downstream classifier is vulnerable to adversarial examples, which are testing inputs with carefully crafted perturbation that the downstream classifier misclassifies. Therefore, in safety and security critical applications, a client aims to build a robust downstream classifier and certify its robustness guarantees against adversarial examples. What APIs should the cloud service provide, such that a client can use any certification method to certify the robustness of its downstream classifier against adversarial examples while minimizing the number of queries to the APIs? How can a service provider pre-train an encoder such that clients can build more certifiably robust downstream classifiers? We aim to answer the two questions in this work. For the first question, we show that the cloud service only needs to provide two APIs, which we carefully design, to enable a client to certify the robustness of its downstream classifier with a minimal number of queries to the APIs. For the second question, we show that an encoder pre-trained using a spectral-norm regularization term enables clients to build more robust downstream classifiers.
    Transfer learning for non-intrusive load monitoring and appliance identification in a smart home. (arXiv:2301.03018v1 [eess.SP])
    Non-intrusive load monitoring (NILM) or energy disaggregation is an inverse problem whereby the goal is to extract the load profiles of individual appliances, given an aggregate load profile of the mains of a home. NILM could help identify the power usage patterns of individual appliances in a home, and thus, could help realize novel energy conservation schemes for smart homes. In this backdrop, this work proposes a novel deep-learning approach to solve the NILM problem and a few related problems as follows. 1) We build upon the reputed seq2-point convolutional neural network (CNN) model to come up with the proposed seq2-[3]-point CNN model to solve the (home) NILM problem and site-NILM problem (basically, NILM at a smaller scale). 2) We solve the related problem of appliance identification by building upon the state-of-the-art (pre-trained) 2D-CNN models, i.e., AlexNet, ResNet-18, and DenseNet-121, which are trained upon two custom datasets that consist of Wavelets and short-time Fourier transform (STFT)-based 2D electrical signatures of the appliances. 3) Finally, we do some basic qualitative inference about an individual appliance's health by comparing the power consumption of the same appliance across multiple homes. Low-frequency REDD dataset is used to train and test the proposed deep learning models for all problems, except site-NILM where REFIT dataset has been used. As for the results, we achieve a maximum accuracy of 94.6\% for home-NILM, 81\% for site-NILM, and 88.9\% for appliance identification (with Resnet-based model).
    k-Means SubClustering: A Differentially Private Algorithm with Improved Clustering Quality. (arXiv:2301.02896v1 [cs.LG])
    In today's data-driven world, the sensitivity of information has been a significant concern. With this data and additional information on the person's background, one can easily infer an individual's private data. Many differentially private iterative algorithms have been proposed in interactive settings to protect an individual's privacy from these inference attacks. The existing approaches adapt the method to compute differentially private(DP) centroids by iterative Llyod's algorithm and perturbing the centroid with various DP mechanisms. These DP mechanisms do not guarantee convergence of differentially private iterative algorithms and degrade the quality of the cluster. Thus, in this work, we further extend the previous work on 'Differentially Private k-Means Clustering With Convergence Guarantee' by taking it as our baseline. The novelty of our approach is to sub-cluster the clusters and then select the centroid which has a higher probability of moving in the direction of the future centroid. At every Lloyd's step, the centroids are injected with the noise using the exponential DP mechanism. The results of the experiments indicate that our approach outperforms the current state-of-the-art method, i.e., the baseline algorithm, in terms of clustering quality while maintaining the same differential privacy requirements. The clustering quality significantly improved by 4.13 and 2.83 times than baseline for the Wine and Breast_Cancer dataset, respectively.
    Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching. (arXiv:2301.02903v1 [cs.LG])
    Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with supervised learning (69.8%).
    Traditional Readability Formulas Compared for English. (arXiv:2301.02975v1 [cs.CL])
    Traditional English readability formulas, or equations, were largely developed in the 20th century. Nonetheless, many researchers still rely on them for various NLP applications. Such a phenomenon is presumably due to the convenience and straightforwardness of readability formulas. In this work, we contribute to the NLP community by 1. introducing New English Readability Formula (NERF), 2. recalibrating the coefficients of old readability formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluating the readability formulas, for use in text simplification studies and medical texts, and 4. developing a Python-based program for the wide application to various NLP projects.
    Online Centralized Non-parametric Change-point Detection via Graph-based Likelihood-ratio Estimation. (arXiv:2301.03011v1 [stat.ML])
    Consider each node of a graph to be generating a data stream that is synchronized and observed at near real-time. At a change-point $\tau$, a change occurs at a subset of nodes $C$, which affects the probability distribution of their associated node streams. In this paper, we propose a novel kernel-based method to both detect $\tau$ and localize $C$, based on the direct estimation of the likelihood-ratio between the post-change and the pre-change distributions of the node streams. Our main working hypothesis is the smoothness of the likelihood-ratio estimates over the graph, i.e connected nodes are expected to have similar likelihood-ratios. The quality of the proposed method is demonstrated on extensive experiments on synthetic scenarios.
    Learning the Relation between Similarity Loss and Clustering Loss in Self-Supervised Learning. (arXiv:2301.03041v1 [cs.CV])
    Self-supervised learning enables networks to learn discriminative features from massive data itself. Most state-of-the-art methods maximize the similarity between two augmentations of one image based on contrastive learning. By utilizing the consistency of two augmentations, the burden of manual annotations can be freed. Contrastive learning exploits instance-level information to learn robust features. However, the learned information is probably confined to different views of the same instance. In this paper, we attempt to leverage the similarity between two distinct images to boost representation in self-supervised learning. In contrast to instance-level information, the similarity between two distinct images may provide more useful information. Besides, we analyze the relation between similarity loss and feature-level cross-entropy loss. These two losses are essential for most deep learning methods. However, the relation between these two losses is not clear. Similarity loss helps obtain instance-level representation, while feature-level cross-entropy loss helps mine the similarity between two distinct images. We provide theoretical analyses and experiments to show that a suitable combination of these two losses can get state-of-the-art results.
    Unsupervised Learning for Combinatorial Optimization Needs Meta-Learning. (arXiv:2301.03116v1 [cs.LG])
    A general framework of unsupervised learning for combinatorial optimization (CO) is to train a neural network (NN) whose output gives a problem solution by directly optimizing the CO objective. Albeit with some advantages over traditional solvers, the current framework optimizes an averaged performance over the distribution of historical problem instances, which misaligns with the actual goal of CO that looks for a good solution to every future encountered instance. With this observation, we propose a new objective of unsupervised learning for CO where the goal of learning is to search for good initialization for future problem instances rather than give direct solutions. We propose a meta-learning-based training pipeline for this new objective. Our method achieves go empirical performance. We observe that even just the initial solution given by our model before fine-tuning can significantly outperform the baselines under various evaluation settings including evaluation across multiple datasets, and the case with big shifts in the problem scale. The reason we conjecture is that meta-learning-based training lets the model loosely tied to each local optima for a training instance while being more adaptive to the changes of optimization landscapes across instances.
    Explaining Graph Neural Networks via Non-parametric Subgraph Matching. (arXiv:2301.02780v1 [cs.LG])
    The great success in graph neural networks (GNNs) provokes the question about explainability: Which fraction of the input graph is the most determinant of the prediction? Particularly, parametric explainers prevail in existing approaches because of their stronger capability to decipher the black-box (i.e., the target GNN). In this paper, based on the observation that graphs typically share some joint motif patterns, we propose a novel non-parametric subgraph matching framework, dubbed MatchExplainer, to explore explanatory subgraphs. It couples the target graph with other counterpart instances and identifies the most crucial joint substructure by minimizing the node corresponding-based distance. Moreover, we note that present graph sampling or node-dropping methods usually suffer from the false positive sampling problem. To ameliorate that issue, we design a new augmentation paradigm named MatchDrop. It takes advantage of MatchExplainer to fix the most informative portion of the graph and merely operates graph augmentations on the rest less informative part. We conduct extensive experiments on both synthetic and real-world datasets and show the effectiveness of our MatchExplainer by outperforming all parametric baselines with significant margins. Additional results also demonstrate that our MatchDrop is a general scheme to be equipped with GNNs for enhanced performance.
    The 3D Structural Phenotype of the Glaucomatous Optic Nerve Head and its Relationship with The Severity of Visual Field Damage. (arXiv:2301.02837v1 [cs.LG])
    $\bf{Purpose}$: To describe the 3D structural changes in both connective and neural tissues of the optic nerve head (ONH) that occur concurrently at different stages of glaucoma using traditional and AI-driven approaches. $\bf{Methods}$: We included 213 normal, 204 mild glaucoma (mean deviation [MD] $\ge$ -6.00 dB), 118 moderate glaucoma (MD of -6.01 to -12.00 dB), and 118 advanced glaucoma patients (MD < -12.00 dB). All subjects had their ONHs imaged in 3D with Spectralis optical coherence tomography. To describe the 3D structural phenotype of glaucoma as a function of severity, we used two different approaches: (1) We extracted human-defined 3D structural parameters of the ONH including retinal nerve fiber layer (RNFL) thickness, lamina cribrosa (LC) shape and depth at different stages of glaucoma; (2) we also employed a geometric deep learning method (i.e. PointNet) to identify the most important 3D structural features that differentiate ONHs from different glaucoma severity groups without any human input. $\bf{Results}$: We observed that the majority of ONH structural changes occurred in the early glaucoma stage, followed by a plateau effect in the later stages. Using PointNet, we also found that 3D ONH structural changes were present in both neural and connective tissues. In both approaches, we observed that structural changes were more prominent in the superior and inferior quadrant of the ONH, particularly in the RNFL, the prelamina, and the LC. As the severity of glaucoma increased, these changes became more diffuse (i.e. widespread), particularly in the LC. $\bf{Conclusions}$: In this study, we were able to uncover complex 3D structural changes of the ONH in both neural and connective tissues as a function of glaucoma severity. We hope to provide new insights into the complex pathophysiology of glaucoma that might help clinicians in their daily clinical care.
    Reducing Over-smoothing in Graph Neural Networks Using Relational Embeddings. (arXiv:2301.02924v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved a lot of success with graph-structured data. However, it is observed that the performance of GNNs does not improve (or even worsen) as the number of layers increases. This effect has known as over-smoothing, which means that the representations of the graph nodes of different classes would become indistinguishable when stacking multiple layers. In this work, we propose a new simple, and efficient method to alleviate the effect of the over-smoothing problem in GNNs by explicitly using relations between node embeddings. Experiments on real-world datasets demonstrate that utilizing node embedding relations makes GNN models such as Graph Attention Network more robust to over-smoothing and achieves better performance with deeper GNNs. Our method can be used in combination with other methods to give the best performance. GNN applications are endless and depend on the user's objective and the type of data that they possess. Solving over-smoothing issues can potentially improve the performance of models on all these tasks.
    AutoAC: Towards Automated Attribute Completion for Heterogeneous Graph Neural Network. (arXiv:2301.03049v1 [cs.LG])
    Many real-world data can be modeled as heterogeneous graphs that contain multiple types of nodes and edges. Meanwhile, due to excellent performance, heterogeneous graph neural networks (GNNs) have received more and more attention. However, the existing work mainly focuses on the design of novel GNN models, while ignoring another important issue that also has a large impact on the model performance, namely the missing attributes of some node types. The handcrafted attribute completion requires huge expert experience and domain knowledge. Also, considering the differences in semantic characteristics between nodes, the attribute completion should be fine-grained, i.e., the attribute completion operation should be node-specific. Moreover, to improve the performance of the downstream graph learning task, attribute completion and the training of the heterogeneous GNN should be jointly optimized rather than viewed as two separate processes. To address the above challenges, we propose a differentiable attribute completion framework called AutoAC for automated completion operation search in heterogeneous GNNs. We first propose an expressive completion operation search space, including topology-dependent and topology-independent completion operations. Then, we propose a continuous relaxation schema and further propose a differentiable completion algorithm where the completion operation search is formulated as a bi-level joint optimization problem. To improve the search efficiency, we leverage two optimization techniques: discrete constraints and auxiliary unsupervised graph node clustering. Extensive experimental results on real-world datasets reveal that AutoAC outperforms the SOTA handcrafted heterogeneous GNNs and the existing attribute completion method
    Subset verification and search algorithms for causal DAGs. (arXiv:2301.03180v1 [cs.LG])
    Learning causal relationships between variables is a fundamental task in causal inference and directed acyclic graphs (DAGs) are a popular choice to represent the causal relationships. As one can recover a causal graph only up to its Markov equivalence class from observations, interventions are often used for the recovery task. Interventions are costly in general and it is important to design algorithms that minimize the number of interventions performed. In this work, we study the problem of learning the causal relationships of a subset of edges (target edges) in a graph with as few interventions as possible. Under the assumptions of faithfulness, causal sufficiency, and ideal interventions, we study this problem in two settings: when the underlying ground truth causal graph is known (subset verification) and when it is unknown (subset search). For the subset verification problem, we provide an efficient algorithm to compute a minimum sized interventional set; we further extend these results to bounded size non-atomic interventions and node-dependent interventional costs. For the subset search problem, in the worst case, we show that no algorithm (even with adaptivity or randomization) can achieve an approximation ratio that is asymptotically better than the vertex cover of the target edges when compared with the subset verification number. This result is surprising as there exists a logarithmic approximation algorithm for the search problem when we wish to recover the whole causal graph. To obtain our results, we prove several interesting structural properties of interventional causal graphs that we believe have applications beyond the subset verification/search problems studied here.
    Advanced Data Augmentation Approaches: A Comprehensive Survey and Future directions. (arXiv:2301.02830v1 [cs.CV])
    Deep learning (DL) algorithms have shown significant performance in various computer vision tasks. However, having limited labelled data lead to a network overfitting problem, where network performance is bad on unseen data as compared to training data. Consequently, it limits performance improvement. To cope with this problem, various techniques have been proposed such as dropout, normalization and advanced data augmentation. Among these, data augmentation, which aims to enlarge the dataset size by including sample diversity, has been a hot topic in recent times. In this article, we focus on advanced data augmentation techniques. we provide a background of data augmentation, a novel and comprehensive taxonomy of reviewed data augmentation techniques, and the strengths and weaknesses (wherever possible) of each technique. We also provide comprehensive results of the data augmentation effect on three popular computer vision tasks, such as image classification, object detection and semantic segmentation. For results reproducibility, we compiled available codes of all data augmentation techniques. Finally, we discuss the challenges and difficulties, and possible future direction for the research community. We believe, this survey provides several benefits i) readers will understand the data augmentation working mechanism to fix overfitting problems ii) results will save the searching time of the researcher for comparison purposes. iii) Codes of the mentioned data augmentation techniques are available at https://github.com/kmr2017/Advanced-Data-augmentation-codes iv) Future work will spark interest in research community.
    SCENE: Reasoning about Traffic Scenes using Heterogeneous Graph Neural Networks. (arXiv:2301.03512v1 [cs.CV])
    Understanding traffic scenes requires considering heterogeneous information about dynamic agents and the static infrastructure. In this work we propose SCENE, a methodology to encode diverse traffic scenes in heterogeneous graphs and to reason about these graphs using a heterogeneous Graph Neural Network encoder and task-specific decoders. The heterogeneous graphs, whose structures are defined by an ontology, consist of different nodes with type-specific node features and different relations with type-specific edge features. In order to exploit all the information given by these graphs, we propose to use cascaded layers of graph convolution. The result is an encoding of the scene. Task-specific decoders can be applied to predict desired attributes of the scene. Extensive evaluation on two diverse binary node classification tasks show the main strength of this methodology: despite being generic, it even manages to outperform task-specific baselines. The further application of our methodology to the task of node classification in various knowledge graphs shows its transferability to other domains.
    Efficient Attack Detection in IoT Devices using Feature Engineering-Less Machine Learning. (arXiv:2301.03532v1 [cs.CR])
    Through the generalization of deep learning, the research community has addressed critical challenges in the network security domain, like malware identification and anomaly detection. However, they have yet to discuss deploying them on Internet of Things (IoT) devices for day-to-day operations. IoT devices are often limited in memory and processing power, rendering the compute-intensive deep learning environment unusable. This research proposes a way to overcome this barrier by bypassing feature engineering in the deep learning pipeline and using raw packet data as input. We introduce a feature engineering-less machine learning (ML) process to perform malware detection on IoT devices. Our proposed model, "Feature engineering-less-ML (FEL-ML)," is a lighter-weight detection algorithm that expends no extra computations on "engineered" features. It effectively accelerates the low-powered IoT edge. It is trained on unprocessed byte-streams of packets. Aside from providing better results, it is quicker than traditional feature-based methods. FEL-ML facilitates resource-sensitive network traffic security with the added benefit of eliminating the significant investment by subject matter experts in feature engineering.
    Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization. (arXiv:2108.11832v2 [math.OC] UPDATED)
    We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle.
    Grokking modular arithmetic. (arXiv:2301.02679v1 [cs.LG])
    We present a simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as ``grokking''. Concretely, we present (i) fully-connected two-layer networks that exhibit grokking on various modular arithmetic tasks under vanilla gradient descent with the MSE loss function in the absence of any regularization; (ii) evidence that grokking modular arithmetic corresponds to learning specific feature maps whose structure is determined by the task; (iii) analytic expressions for the weights -- and thus for the feature maps -- that solve a large class of modular arithmetic tasks; and (iv) evidence that these feature maps are also found by vanilla gradient descent as well as AdamW, thereby establishing complete interpretability of the representations learnt by the network.
    Few-shot Node Classification with Extremely Weak Supervision. (arXiv:2301.02708v1 [cs.LG])
    Few-shot node classification aims at classifying nodes with limited labeled nodes as references. Recent few-shot node classification methods typically learn from classes with abundant labeled nodes (i.e., meta-training classes) and then generalize to classes with limited labeled nodes (i.e., meta-test classes). Nevertheless, on real-world graphs, it is usually difficult to obtain abundant labeled nodes for many classes. In practice, each meta-training class can only consist of several labeled nodes, known as the extremely weak supervision problem. In few-shot node classification, with extremely limited labeled nodes for meta-training, the generalization gap between meta-training and meta-test will become larger and thus lead to suboptimal performance. To tackle this issue, we study a novel problem of few-shot node classification with extremely weak supervision and propose a principled framework X-FNC under the prevalent meta-learning framework. Specifically, our goal is to accumulate meta-knowledge across different meta-training tasks with extremely weak supervision and generalize such knowledge to meta-test tasks. To address the challenges resulting from extremely scarce labeled nodes, we propose two essential modules to obtain pseudo-labeled nodes as extra references and effectively learn from extremely limited supervision information. We further conduct extensive experiments on four node classification datasets with extremely weak supervision to validate the superiority of our framework compared to the state-of-the-art baselines.
    Perceptual-Neural-Physical Sound Matching. (arXiv:2301.02886v1 [cs.SD])
    Sound matching algorithms seek to approximate a target waveform by parametric audio synthesis. Deep neural networks have achieved promising results in matching sustained harmonic tones. However, the task is more challenging when targets are nonstationary and inharmonic, e.g., percussion. We attribute this problem to the inadequacy of loss function. On one hand, mean square error in the parametric domain, known as "P-loss", is simple and fast but fails to accommodate the differing perceptual significance of each parameter. On the other hand, mean square error in the spectrotemporal domain, known as "spectral loss", is perceptually motivated and serves in differentiable digital signal processing (DDSP). Yet, spectral loss has more local minima than P-loss and its gradient may be computationally expensive; hence a slow convergence. Against this conundrum, we present Perceptual-Neural-Physical loss (PNP). PNP is the optimal quadratic approximation of spectral loss while being as fast as P-loss during training. We instantiate PNP with physical modeling synthesis as decoder and joint time-frequency scattering transform (JTFS) as spectral representation. We demonstrate its potential on matching synthetic drum sounds in comparison with other loss functions.
    On Consistency and Asymptotic Normality of Least Absolute Deviation Estimators for 2-dimensional Sinusoidal Model. (arXiv:2301.03229v1 [math.ST])
    Estimation of the parameters of a 2-dimensional sinusoidal model is a fundamental problem in digital signal processing. In this paper, we propose a robust least absolute deviation (LAD) estimators for parameter estimation. The proposed methodology provides a robust alternative to non-robust estimation techniques like the least squares estimators, in situations where outliers are present in the data or in the presence of heavy tailed noise. We study important asymptotic properties of the LAD estimators and establish the strong consistency and asymptotic normality of the LAD estimators. We further illustrate the advantage of using LAD estimators over least squares estimators through extensive simulation studies.
    AI Maintenance: A Robustness Perspective. (arXiv:2301.03052v1 [cs.LG])
    With the advancements in machine learning (ML) methods and compute resources, artificial intelligence (AI) empowered systems are becoming a prevailing technology. However, current AI technology such as deep learning is not flawless. The significantly increased model complexity and data scale incur intensified challenges when lacking trustworthiness and transparency, which could create new risks and negative impacts. In this paper, we carve out AI maintenance from the robustness perspective. We start by introducing some highlighted robustness challenges in the AI lifecycle and motivating AI maintenance by making analogies to car maintenance. We then propose an AI model inspection framework to detect and mitigate robustness risks. We also draw inspiration from vehicle autonomy to define the levels of AI robustness automation. Our proposal for AI maintenance facilitates robustness assessment, status tracking, risk scanning, model hardening, and regulation throughout the AI lifecycle, which is an essential milestone toward building sustainable and trustworthy AI ecosystems.
    Why do Nearest Neighbor Language Models Work?. (arXiv:2301.02828v1 [cs.CL])
    Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations through a neural network consuming the immediate previous context. However recently, retrieval-augmented LMs have shown to improve over standard neural LMs, by accessing information retrieved from a large datastore, in addition to their standard, parametric, next-word prediction. In this paper, we set out to understand why retrieval-augmented language models, and specifically why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs, even when the k-nearest neighbor component retrieves examples from the same training set that the LM was originally trained on. To this end, we perform a careful analysis of the various dimensions over which kNN-LM diverges from standard LMs, and investigate these dimensions one by one. Empirically, we identify three main reasons why kNN-LM performs better than standard LMs: using a different input representation for predicting the next tokens, approximate kNN search, and the importance of softmax temperature for the kNN distribution. Further, we incorporate these insights into the model architecture or the training procedure of the standard parametric LM, improving its results without the need for an explicit retrieval component. The code is available at https://github.com/frankxu2004/knnlm-why.
    Isotonic Recalibration under a Low Signal-to-Noise Ratio. (arXiv:2301.02692v1 [stat.ME])
    Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to ensure auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to explainable pricing systems because the resulting isotonically recalibrated regression functions have a low complexity.
    BQ-NCO: Bisimulation Quotienting for Generalizable Neural Combinatorial Optimization. (arXiv:2301.03313v1 [cs.LG])
    Despite the success of Neural Combinatorial Optimization methods for end-to-end heuristic learning, out-of-distribution generalization remains a challenge. In this paper, we present a novel formulation of combinatorial optimization (CO) problems as Markov Decision Processes (MDPs) that effectively leverages symmetries of the CO problems to improve out-of-distribution robustness. Starting from the standard MDP formulation of constructive heuristics, we introduce a generic transformation based on bisimulation quotienting (BQ) in MDPs. This transformation allows to reduce the state space by accounting for the intrinsic symmetries of the CO problem and facilitates the MDP solving. We illustrate our approach on the Traveling Salesman, Capacitated Vehicle Routing and Knapsack Problems. We present a BQ reformulation of these problems and introduce a simple attention-based policy network that we train by imitation of (near) optimal solutions for small instances from a single distribution. We obtain new state-of-the-art generalization results for instances with up to 1000 nodes from synthetic and realistic benchmarks that vary both in size and node distributions.
    Modeling Scattering Coefficients in Antenna Design using Self-Attentive Complex Polynomials with Image-based Representation. (arXiv:2301.02747v1 [cs.LG])
    Finding antenna designs that satisfy frequency requirements and are also optimal with respect to multiple physical criteria is a critical component in designing next generation hardware. However, such a process is non-trivial because the objective function is typically highly nonlinear and sensitive to subtle design change. Moreover, the objective to be optimized often involves electromagnetic (EM) simulations, which is slow and expensive with commercial simulation software. In this work, we propose a sample-efficient and accurate surrogate model, named CZP (Constant Zeros Poles), to directly estimate the scattering coefficients in the frequency domain of a given 2D planar antenna design, without using a simulator. CZP achieves this by predicting the complex zeros and poles for the frequency response of scattering coefficients, which we have theoretically justified for any linear PDE, including Maxwell's equations. Moreover, instead of using low-dimensional representations, CZP leverages a novel image-based representation for antenna topology inspired by the existing mesh-based EM simulation techniques, and attention-based neural network architectures. We demonstrate experimentally that CZP not only outperforms baselines in terms of test loss, but also is able to find 2D antenna designs verifiable by commercial software with only 40k training samples, when coupling with advanced sequential search techniques like reinforcement learning.
    Principal Component Analysis in Space Forms. (arXiv:2301.02750v1 [stat.ML])
    Principal component analysis (PCA) is a workhorse of modern data science. Practitioners typically perform PCA assuming the data conforms to Euclidean geometry. However, for specific data types, such as hierarchical data, other geometrical spaces may be more appropriate. We study PCA in space forms; that is, those with constant positive (spherical) and negative (hyperbolic) curvatures, in addition to zero-curvature (Euclidean) spaces. At any point on a Riemannian manifold, one can define a Riemannian affine subspace based on a set of tangent vectors and use invertible maps to project tangent vectors to the manifold and vice versa. Finding a low-dimensional Riemannian affine subspace for a set of points in a space form amounts to dimensionality reduction because, as we show, any such affine subspace is isometric to a space form of the same dimension and curvature. To find principal components, we seek a (Riemannian) affine subspace that best represents a set of manifold-valued data points with the minimum average cost of projecting data points onto the affine subspace. We propose specific cost functions that bring about two major benefits: (1) the affine subspace can be estimated by solving an eigenequation -- similar to that of Euclidean PCA, and (2) optimal affine subspaces of different dimensions form a nested set. These properties provide advances over existing methods which are mostly iterative algorithms with slow convergence and weaker theoretical guarantees. Specifically for hyperbolic PCA, the associated eigenequation operates in the Lorentzian space, endowed with an indefinite inner product; we thus establish a connection between Lorentzian and Euclidean eigenequations. We evaluate the proposed space form PCA on data sets simulated in spherical and hyperbolic spaces and show that it outperforms alternative methods in convergence speed or accuracy, often both.
    Visual Story Generation Based on Emotion and Keywords. (arXiv:2301.02777v1 [cs.AI])
    Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development. This work proposes a story generation pipeline to co-create visual stories with the users. The pipeline allows the user to control events and emotions on the generated content. The pipeline includes two parts: narrative and image generation. For narrative generation, the system generates the next sentence using user-specified keywords and emotion labels. For image generation, diffusion models are used to create a visually appealing image corresponding to each generated sentence. Further, object recognition is applied to the generated images to allow objects in these images to be mentioned in future story development.
    Discovery of structure-property relations for molecules via hypothesis-driven active learning over the chemical space. (arXiv:2301.02665v1 [cs.LG])
    Discovery of the molecular candidates for applications in drug targets, biomolecular systems, catalysts, photovoltaics, organic electronics, and batteries, necessitates development of machine learning algorithms capable of rapid exploration of the chemical spaces targeting the desired functionalities. Here we introduce a novel approach for the active learning over the chemical spaces based on hypothesis learning. We construct the hypotheses on the possible relationships between structures and functionalities of interest based on a small subset of data and introduce them as (probabilistic) mean functions for the Gaussian process. This approach combines the elements from the symbolic regression methods such as SISSO and active learning into a single framework. Here, we demonstrate it for the QM9 dataset, but it can be applied more broadly to datasets from both domains of molecular and solid-state materials sciences.
    Multimodal Lyrics-Rhythm Matching. (arXiv:2301.02732v1 [cs.SD])
    Despite the recent increase in research on artificial intelligence for music, prominent correlations between key components of lyrics and rhythm such as keywords, stressed syllables, and strong beats are not frequently studied. Ths is likely due to challenges such as audio misalignment, inaccuracies in syllabic identification, and most importantly, the need for cross-disciplinary knowledge. To address this lack of research, we propose a novel multimodal lyrics-rhythm matching approach in this paper that specifically matches key components of lyrics and music with each other without any language limitations. We use audio instead of sheet music with readily available metadata, which creates more challenges yet increases the application flexibility of our method. Furthermore, our approach creatively generates several patterns involving various multimodalities, including music strong beats, lyrical syllables, auditory changes in a singer's pronunciation, and especially lyrical keywords, which are utilized for matching key lyrical elements with key rhythmic elements. This advantageous approach not only provides a unique way to study auditory lyrics-rhythm correlations including efficient rhythm-based audio alignment algorithms, but also bridges computational linguistics with music as well as music cognition. Our experimental results reveal an 0.81 probability of matching on average, and around 30% of the songs have a probability of 0.9 or higher of keywords landing on strong beats, including 12% of the songs with a perfect landing. Also, the similarity metrics are used to evaluate the correlation between lyrics and rhythm. It shows that nearly 50% of the songs have 0.70 similarity or higher. In conclusion, our approach contributes significantly to the lyrics-rhythm relationship by computationally unveiling insightful correlations.
  • Open

    A Characterization of Multilabel Learnability. (arXiv:2301.02729v1 [cs.LG])
    We consider the problem of multilabel classification and investigate learnability in batch and online settings. In both settings, we show that a multilabel function class is learnable if and only if each single-label restriction of the function class is learnable. As extensions, we also study multioutput regression in the batch setting and bandit feedback in the online setting. For the former, we characterize learnability w.r.t. $L_p$ losses. For the latter, we show a similar characterization as in the full-feedback setting.
    Upward lightning at wind turbines: Risk assessment from larger-scale meteorology. (arXiv:2301.03360v1 [stat.ML])
    Upward lightning (UL) has become an increasingly important threat to wind turbines as ever more of them are being installed for renewably producing electricity. The taller the wind turbine the higher the risk that the type of lightning striking the man-made structure is UL. UL can be much more destructive than downward lightning due to its long lasting initial continuous current leading to a large charge transfer within the lightning discharge process. Current standards for the risk assessment of lightning at wind turbines mainly take the summer lightning activity into account, which is inferred from LLS. Ground truth lightning current measurements reveal that less than 50% of UL might be detected by lightning location systems (LLS). This leads to a large underestimation of the proportion of LLS-non-detectable UL at wind turbines, which is the dominant lightning type in the cold season. This study aims to assess the risk of LLS-detectable and LLS-non-detectable UL at wind turbines using direct UL measurements at the Gaisberg Tower (Austria) and S\"antis Tower (Switzerland). Direct UL observations are linked to meteorological reanalysis data and joined by random forests, a powerful machine learning technique. The meteorological drivers for the non-/occurrence of LLS-detectable and LLS-non-detectable UL, respectively, are found from the random forest models trained at the towers and have large predictive skill on independent data. In a second step the results from the tower-trained models are extended to a larger study domain (Central and Northern Germany). The tower-trained models for LLS-detectable lightning is independently verified at wind turbine locations in that domain and found to reliably diagnose that type of UL. Risk maps based on case study events show that high diagnosed probabilities in the study domain coincide with actual UL events.
    Generalized Kernel Regularized Least Squares. (arXiv:2209.14355v2 [stat.ML] UPDATED)
    Kernel Regularized Least Squares (KRLS) is a popular method for flexibly estimating models that may have complex relationships between variables. However, its usefulness to many researchers is limited for two reasons. First, existing approaches are inflexible and do not allow KRLS to be combined with theoretically-motivated extensions such as random effects, unregularized fixed effects, or non-Gaussian outcomes. Second, estimation is extremely computationally intensive for even modestly sized datasets. Our paper addresses both concerns by introducing generalized KRLS (gKRLS). We note that KRLS can be re-formulated as a hierarchical model thereby allowing easy inference and modular model construction where KRLS can be used alongside random effects, splines, and unregularized fixed effects. Computationally, we also implement random sketching to dramatically accelerate estimation while incurring a limited penalty in estimation quality. We demonstrate that gKRLS can be fit on datasets with tens of thousands of observations in under one minute. Further, state-of-the-art techniques that require fitting the model over a dozen times (e.g. meta-learners) can be estimated quickly.
    Sharper Analysis for Minibatch Stochastic Proximal Point Methods: Stability, Smoothness, and Deviation. (arXiv:2301.03125v1 [stat.ML])
    The stochastic proximal point (SPP) methods have gained recent attention for stochastic optimization, with strong convergence guarantees and superior robustness to the classic stochastic gradient descent (SGD) methods showcased at little to no cost of computational overhead added. In this article, we study a minibatch variant of SPP, namely M-SPP, for solving convex composite risk minimization problems. The core contribution is a set of novel excess risk bounds of M-SPP derived through the lens of algorithmic stability theory. Particularly under smoothness and quadratic growth conditions, we show that M-SPP with minibatch-size $n$ and iteration count $T$ enjoys an in-expectation fast rate of convergence consisting of an $\mathcal{O}\left(\frac{1}{T^2}\right)$ bias decaying term and an $\mathcal{O}\left(\frac{1}{nT}\right)$ variance decaying term. In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches by revealing the impact of noise level of model on convergence rate. In the complementary small-$T$-large-$n$ regime, we provide a two-phase extension of M-SPP to achieve comparable convergence rates. Moreover, we derive a near-tight high probability (over the randomness of data) bound on the parameter estimation error of a sampling-without-replacement variant of M-SPP. Numerical evidences are provided to support our theoretical predictions when substantialized to Lasso and logistic regression models.
    Beyond calibration: estimating the grouping loss of modern neural networks. (arXiv:2210.16315v2 [cs.LG] UPDATED)
    The ability to ensure that a classifier gives reliable confidence scores is essential to ensure informed decision-making. To this end, recent work has focused on miscalibration, i.e., the over or under confidence of model scores. Yet calibration is not enough: even a perfectly calibrated classifier with the best possible accuracy can have confidence scores that are far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings, which highlights the importance of pre-production validation.  ( 2 min )
    Principal Component Analysis in Space Forms. (arXiv:2301.02750v1 [stat.ML])
    Principal component analysis (PCA) is a workhorse of modern data science. Practitioners typically perform PCA assuming the data conforms to Euclidean geometry. However, for specific data types, such as hierarchical data, other geometrical spaces may be more appropriate. We study PCA in space forms; that is, those with constant positive (spherical) and negative (hyperbolic) curvatures, in addition to zero-curvature (Euclidean) spaces. At any point on a Riemannian manifold, one can define a Riemannian affine subspace based on a set of tangent vectors and use invertible maps to project tangent vectors to the manifold and vice versa. Finding a low-dimensional Riemannian affine subspace for a set of points in a space form amounts to dimensionality reduction because, as we show, any such affine subspace is isometric to a space form of the same dimension and curvature. To find principal components, we seek a (Riemannian) affine subspace that best represents a set of manifold-valued data points with the minimum average cost of projecting data points onto the affine subspace. We propose specific cost functions that bring about two major benefits: (1) the affine subspace can be estimated by solving an eigenequation -- similar to that of Euclidean PCA, and (2) optimal affine subspaces of different dimensions form a nested set. These properties provide advances over existing methods which are mostly iterative algorithms with slow convergence and weaker theoretical guarantees. Specifically for hyperbolic PCA, the associated eigenequation operates in the Lorentzian space, endowed with an indefinite inner product; we thus establish a connection between Lorentzian and Euclidean eigenequations. We evaluate the proposed space form PCA on data sets simulated in spherical and hyperbolic spaces and show that it outperforms alternative methods in convergence speed or accuracy, often both.  ( 2 min )
    Isotonic Recalibration under a Low Signal-to-Noise Ratio. (arXiv:2301.02692v1 [stat.ME])
    Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to ensure auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to explainable pricing systems because the resulting isotonically recalibrated regression functions have a low complexity.  ( 2 min )
    Improved Training of Physics-Informed Neural Networks with Model Ensembles. (arXiv:2204.05108v2 [cs.LG] UPDATED)
    Learning the solution of partial differential equations (PDEs) with a neural network (known in the literature as a physics-informed neural network, PINN) is an attractive alternative to traditional solvers due to its elegancy, greater flexibility and the ease of incorporating observed data. However, training PINNs is notoriously difficult in practice. One problem is the existence of multiple simple (but wrong) solutions which are attractive for PINNs when the solution interval is too large. In this paper, we propose to expand the solution interval gradually to make the PINN converge to the correct solution. To find a good schedule for the solution interval expansion, we train an ensemble of PINNs. The idea is that all ensemble members converge to the same solution in the vicinity of observed data (e.g., initial conditions) while they may be pulled towards different wrong solutions farther away from the observations. Therefore, we use the ensemble agreement as the criterion for including new points for computing the loss derived from PDEs. We show experimentally that the proposed method can improve the accuracy of the found solution.  ( 2 min )
    Efficient Approximation of Gromov-Wasserstein Distance Using Importance Sparsification. (arXiv:2205.13573v3 [cs.LG] UPDATED)
    As a valid metric of metric-measure spaces, Gromov-Wasserstein (GW) distance has shown the potential for matching problems of structured data like point clouds and graphs. However, its application in practice is limited due to the high computational complexity. To overcome this challenge, we propose a novel importance sparsification method, called \textsc{Spar-GW}, to approximate GW distance efficiently. In particular, instead of considering a dense coupling matrix, our method leverages a simple but effective sampling strategy to construct a sparse coupling matrix and update it with few computations. The proposed \textsc{Spar-GW} method is applicable to the GW distance with arbitrary ground cost, and it reduces the complexity from $O(n^4)$ to $O(n^{2+\delta})$ for an arbitrary small $\delta>0$. Theoretically, the convergence and consistency of the proposed estimation for GW distance are established under mild regularity conditions. In addition, this method can be extended to approximate the variants of GW distance, including the entropic GW distance, the fused GW distance, and the unbalanced GW distance. Experiments show the superiority of our \textsc{Spar-GW} to state-of-the-art methods in both synthetic and real-world tasks.  ( 2 min )
    Making Decisions under Outcome Performativity. (arXiv:2210.01745v2 [cs.LG] UPDATED)
    Decision-makers often act in response to data-driven predictions, with the goal of achieving favorable outcomes. In such settings, predictions don't passively forecast the future; instead, predictions actively shape the distribution of outcomes they are meant to predict. This performative prediction setting raises new challenges for learning "optimal" decision rules. In particular, existing solution concepts do not address the apparent tension between the goals of forecasting outcomes accurately and steering individuals to achieve desirable outcomes. To contend with this concern, we introduce a new optimality concept -- performative omniprediction -- adapted from the supervised (non-performative) learning setting. A performative omnipredictor is a single predictor that simultaneously encodes the optimal decision rule with respect to many possibly-competing objectives. Our main result demonstrates that efficient performative omnipredictors exist, under a natural restriction of performative prediction, which we call outcome performativity. On a technical level, our results follow by carefully generalizing the notion of outcome indistinguishability to the outcome performative setting. From an appropriate notion of Performative OI, we recover many consequences known to hold in the supervised setting, such as omniprediction and universal adaptability.  ( 2 min )
    EMAHA-DB1: A New Upper Limb sEMG Dataset for Classification of Activities of Daily Living. (arXiv:2301.03325v1 [eess.SP])
    In this paper, we present electromyography analysis of human activity - database 1 (EMAHA-DB1), a novel dataset of multi-channel surface electromyography (sEMG) signals to evaluate the activities of daily living (ADL). The dataset is acquired from 25 able-bodied subjects while performing 22 activities categorised according to functional arm activity behavioral system (FAABOS) (3 - full hand gestures, 6 - open/close office draw, 8 - grasping and holding of small office objects, 2 - flexion and extension of finger movements, 2 - writing and 1 - rest). The sEMG data is measured by a set of five Noraxon Ultium wireless sEMG sensors with Ag/Agcl electrodes placed on a human hand. The dataset is analyzed for hand activity recognition classification performance. The classification is performed using four state-ofthe-art machine learning classifiers, including Random Forest (RF), Fine K-Nearest Neighbour (KNN), Ensemble KNN (sKNN) and Support Vector Machine (SVM) with seven combinations of time domain and frequency domain feature sets. The state-of-theart classification accuracy on five FAABOS categories is 83:21% by using the SVM classifier with the third order polynomial kernel using energy feature and auto regressive feature set ensemble. The classification accuracy on 22 class hand activities is 75:39% by the same SVM classifier with the log moments in frequency domain (LMF) feature, modified LMF, time domain statistical (TDS) feature, spectral band powers (SBP), channel cross correlation and local binary patterns (LBP) set ensemble. The analysis depicts the technical challenges addressed by the dataset. The developed dataset can be used as a benchmark for various classification methods as well as for sEMG signal analysis corresponding to ADL and for the development of prosthetics and other wearable robotics.  ( 2 min )
    A Newton-CG based augmented Lagrangian method for finding a second-order stationary point of nonconvex equality constrained optimization with complexity guarantees. (arXiv:2301.03139v1 [math.OC])
    In this paper we consider finding a second-order stationary point (SOSP) of nonconvex equality constrained optimization when a nearly feasible point is known. In particular, we first propose a new Newton-CG method for finding an approximate SOSP of unconstrained optimization and show that it enjoys a substantially better complexity than the Newton-CG method [56]. We then propose a Newton-CG based augmented Lagrangian (AL) method for finding an approximate SOSP of nonconvex equality constrained optimization, in which the proposed Newton-CG method is used as a subproblem solver. We show that under a generalized linear independence constraint qualification (GLICQ), our AL method enjoys a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-7/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-7/2}\min\{n,\epsilon^{-3/4}\})$ for finding an $(\epsilon,\sqrt{\epsilon})$-SOSP of nonconvex equality constrained optimization with high probability, which are significantly better than the ones achieved by the proximal AL method [60]. Besides, we show that it has a total inner iteration complexity of $\widetilde{\cal O}(\epsilon^{-11/2})$ and an operation complexity of $\widetilde{\cal O}(\epsilon^{-11/2}\min\{n,\epsilon^{-5/4}\})$ when the GLICQ does not hold. To the best of our knowledge, all the complexity results obtained in this paper are new for finding an approximate SOSP of nonconvex equality constrained optimization with high probability. Preliminary numerical results also demonstrate the superiority of our proposed methods over the ones in [56,60].  ( 2 min )
    Asymptotic Bounds for Smoothness Parameter Estimates in Gaussian Process Interpolation. (arXiv:2203.05400v3 [math.ST] UPDATED)
    It is common to model a deterministic response function, such as the output of a computer experiment, as a Gaussian process with a Mat\'ern covariance kernel. The smoothness parameter of a Mat\'ern kernel determines many important properties of the model in the large data limit, including the rate of convergence of the conditional mean to the response function. We prove that the maximum likelihood estimate of the smoothness parameter cannot asymptotically undersmooth the truth when the data are obtained on a fixed bounded subset of $\mathbb{R}^d$. That is, if the data-generating response function has Sobolev smoothness $\nu_0 + d/2$, then the smoothness parameter estimate cannot be asymptotically less than $\nu_0 + d/2$. The lower bound is sharp. Additionally, we show that maximum likelihood estimation finds the "correct" smoothness for a class of compactly supported self-similar functions. We also consider cross-validation and prove an asymptotic lower bound $\nu_0$, which however is unlikely to be sharp. The results are based on approximation theory in Sobolev spaces and some general theorems that restrict the set of values that the parameter estimators can take.  ( 2 min )
    Exploration in Linear Bandits with Rich Action Sets and its Implications for Inference. (arXiv:2207.11597v3 [cs.LG] UPDATED)
    We present a non-asymptotic lower bound on the eigenspectrum of the design matrix generated by any linear bandit algorithm with sub-linear regret when the action set has well-behaved curvature. Specifically, we show that the minimum eigenvalue of the expected design matrix grows as $\Omega(\sqrt{n})$ whenever the expected cumulative regret of the algorithm is $O(\sqrt{n})$, where $n$ is the learning horizon, and the action-space has a constant Hessian around the optimal arm. This shows that such action-spaces force a polynomial lower bound rather than a logarithmic lower bound, as shown by \cite{lattimore2017end}, in discrete (i.e., well-separated) action spaces. Furthermore, while the previous result is shown to hold only in the asymptotic regime (as $n \to \infty$), our result for these "locally rich" action spaces is any-time. Additionally, under a mild technical assumption, we obtain a similar lower bound on the minimum eigen value holding with high probability. We apply our result to two practical scenarios -- \emph{model selection} and \emph{clustering} in linear bandits. For model selection, we show that an epoch-based linear bandit algorithm adapts to the true model complexity at a rate exponential in the number of epochs, by virtue of our novel spectral bound. For clustering, we consider a multi agent framework where we show, by leveraging the spectral result, that no forced exploration is necessary -- the agents can run a linear bandit algorithm and estimate their underlying parameters at once, and hence incur a low regret.  ( 2 min )
    Stochastic Langevin Monte Carlo for (weakly) log-concave posterior distributions. (arXiv:2301.03077v1 [stat.ML])
    In this paper, we investigate a continuous time version of the Stochastic Langevin Monte Carlo method, introduced in [WT11], that incorporates a stochastic sampling step inside the traditional over-damped Langevin diffusion. This method is popular in machine learning for sampling posterior distribution. We will pay specific attention in our work to the computational cost in terms of $n$ (the number of observations that produces the posterior distribution), and $d$ (the dimension of the ambient space where the parameter of interest is living). We derive our analysis in the weakly convex framework, which is parameterized with the help of the Kurdyka-\L ojasiewicz (KL) inequality, that permits to handle a vanishing curvature settings, which is far less restrictive when compared to the simple strongly convex case. We establish that the final horizon of simulation to obtain an $\varepsilon$ approximation (in terms of entropy) is of the order $( d \log(n)^2 )^{(1+r)^2} [\log^2(\varepsilon^{-1}) + n^2 d^{2(1+r)} \log^{4(1+r)}(n) ]$ with a Poissonian subsampling of parameter $\left(n ( d \log^2(n))^{1+r}\right)^{-1}$, where the parameter $r$ is involved in the KL inequality and varies between $0$ (strongly convex case) and $1$ (limiting Laplace situation).  ( 2 min )
    Exploration in Model-based Reinforcement Learning with Randomized Reward. (arXiv:2301.03142v1 [stat.ML])
    Model-based Reinforcement Learning (MBRL) has been widely adapted due to its sample efficiency. However, existing worst-case regret analysis typically requires optimistic planning, which is not realistic in general. In contrast, motivated by the theory, empirical study utilizes ensemble of models, which achieve state-of-the-art performance on various testing environments. Such deviation between theory and empirical study leads us to question whether randomized model ensemble guarantee optimism, and hence the optimal worst-case regret? This paper partially answers such question from the perspective of reward randomization, a scarcely explored direction of exploration with MBRL. We show that under the kernelized linear regulator (KNR) model, reward randomization guarantees a partial optimism, which further yields a near-optimal worst-case regret in terms of the number of interactions. We further extend our theory to generalized function approximation and identified conditions for reward randomization to attain provably efficient exploration. Correspondingly, we propose concrete examples of efficient reward randomization. To the best of our knowledge, our analysis establishes the first worst-case regret analysis on randomized MBRL with function approximation.  ( 2 min )
    Online Centralized Non-parametric Change-point Detection via Graph-based Likelihood-ratio Estimation. (arXiv:2301.03011v1 [stat.ML])
    Consider each node of a graph to be generating a data stream that is synchronized and observed at near real-time. At a change-point $\tau$, a change occurs at a subset of nodes $C$, which affects the probability distribution of their associated node streams. In this paper, we propose a novel kernel-based method to both detect $\tau$ and localize $C$, based on the direct estimation of the likelihood-ratio between the post-change and the pre-change distributions of the node streams. Our main working hypothesis is the smoothness of the likelihood-ratio estimates over the graph, i.e connected nodes are expected to have similar likelihood-ratios. The quality of the proposed method is demonstrated on extensive experiments on synthetic scenarios.  ( 2 min )
    On Consistency and Asymptotic Normality of Least Absolute Deviation Estimators for 2-dimensional Sinusoidal Model. (arXiv:2301.03229v1 [math.ST])
    Estimation of the parameters of a 2-dimensional sinusoidal model is a fundamental problem in digital signal processing. In this paper, we propose a robust least absolute deviation (LAD) estimators for parameter estimation. The proposed methodology provides a robust alternative to non-robust estimation techniques like the least squares estimators, in situations where outliers are present in the data or in the presence of heavy tailed noise. We study important asymptotic properties of the LAD estimators and establish the strong consistency and asymptotic normality of the LAD estimators. We further illustrate the advantage of using LAD estimators over least squares estimators through extensive simulation studies.  ( 2 min )
    The Optimal Input-Independent Baseline for Binary Classification: The Dutch Draw. (arXiv:2301.03318v1 [cs.LG])
    Before any binary classification model is taken into practice, it is important to validate its performance on a proper test set. Without a frame of reference given by a baseline method, it is impossible to determine if a score is `good' or `bad'. The goal of this paper is to examine all baseline methods that are independent of feature values and determine which model is the `best' and why. By identifying which baseline models are optimal, a crucial selection decision in the evaluation process is simplified. We prove that the recently proposed Dutch Draw baseline is the best input-independent classifier (independent of feature values) for all positional-invariant measures (independent of sequence order) assuming that the samples are randomly shuffled. This means that the Dutch Draw baseline is the optimal baseline under these intuitive requirements and should therefore be used in practice.  ( 2 min )
    A Sublinear-Time Quantum Algorithm for Approximating Partition Functions. (arXiv:2207.08643v2 [quant-ph] UPDATED)
    We present a novel quantum algorithm for estimating Gibbs partition functions in sublinear time with respect to the logarithm of the size of the state space. This is the first speed-up of this type to be obtained over the seminal nearly-linear time algorithm of \v{S}tefankovi\v{c}, Vempala and Vigoda [JACM, 2009]. Our result also preserves the quadratic speed-up in precision and spectral gap achieved in previous work by exploiting the properties of quantum Markov chains. As an application, we obtain new polynomial improvements over the best-known algorithms for computing the partition function of the Ising model, counting the number of $k$-colorings, matchings or independent sets of a graph, and estimating the volume of a convex body. Our approach relies on developing new variants of the quantum phase and amplitude estimation algorithms that return nearly unbiased estimates with low variance and without destroying their initial quantum state. We extend these subroutines into a nearly unbiased quantum mean estimator that reduces the variance quadratically faster than the classical empirical mean. No such estimator was known to exist prior to our work. These properties, which are of general interest, lead to better convergence guarantees within the paradigm of simulated annealing for computing partition functions.  ( 2 min )
    Subset verification and search algorithms for causal DAGs. (arXiv:2301.03180v1 [cs.LG])
    Learning causal relationships between variables is a fundamental task in causal inference and directed acyclic graphs (DAGs) are a popular choice to represent the causal relationships. As one can recover a causal graph only up to its Markov equivalence class from observations, interventions are often used for the recovery task. Interventions are costly in general and it is important to design algorithms that minimize the number of interventions performed. In this work, we study the problem of learning the causal relationships of a subset of edges (target edges) in a graph with as few interventions as possible. Under the assumptions of faithfulness, causal sufficiency, and ideal interventions, we study this problem in two settings: when the underlying ground truth causal graph is known (subset verification) and when it is unknown (subset search). For the subset verification problem, we provide an efficient algorithm to compute a minimum sized interventional set; we further extend these results to bounded size non-atomic interventions and node-dependent interventional costs. For the subset search problem, in the worst case, we show that no algorithm (even with adaptivity or randomization) can achieve an approximation ratio that is asymptotically better than the vertex cover of the target edges when compared with the subset verification number. This result is surprising as there exists a logarithmic approximation algorithm for the search problem when we wish to recover the whole causal graph. To obtain our results, we prove several interesting structural properties of interventional causal graphs that we believe have applications beyond the subset verification/search problems studied here.  ( 2 min )
    Batch Bayesian Optimization via Particle Gradient Flows. (arXiv:2209.04722v2 [stat.ML] UPDATED)
    Bayesian Optimisation (BO) methods seek to find global optima of objective functions which are only available as a black-box or are expensive to evaluate. Such methods construct a surrogate model for the objective function, quantifying the uncertainty in that surrogate through Bayesian inference. Objective evaluations are sequentially determined by maximising an acquisition function at each step. However, this ancilliary optimisation problem can be highly non-trivial to solve, due to the non-convexity of the acquisition function, particularly in the case of batch Bayesian optimisation, where multiple points are selected in every step. In this work we reformulate batch BO as an optimisation problem over the space of probability measures. We construct a new acquisition function based on multipoint expected improvement which is convex over the space of probability measures. Practical schemes for solving this `inner' optimisation problem arise naturally as gradient flows of this objective function. We demonstrate the efficacy of this new method on different benchmark functions and compare with state-of-the-art batch BO methods.
    Accelerated Randomized Block-Coordinate Algorithms for Co-coercive Equations and Applications. (arXiv:2301.03113v1 [math.OC])
    In this paper, we develop an accelerated randomized block-coordinate algorithm to approximate a solution of a co-coercive equation. Such an equation plays a central role in optimization and related fields and covers many mathematical models as special cases, including convex optimization, convex-concave minimax, and variational inequality problems. Our algorithm relies on a recent Nesterov's accelerated interpretation of the Halpern fixed-point iteration in [48]. We establish that the new algorithm achieves $\mathcal{O}(1/k^2)$-convergence rate on $\mathbb{E}[\Vert Gx^k\Vert^2]$ through the last-iterate, where $G$ is the underlying co-coercive operator, $\mathbb{E}[\cdot]$ is the expectation, and $k$ is the iteration counter. This rate is significantly faster than $\mathcal{O}(1/k)$ rates in standard forward or gradient-based methods from the literature. We also prove $o(1/k^2)$ rates on both $\mathbb{E}[\Vert Gx^k\Vert^2]$ and $\mathbb{E}[\Vert x^{k+1} - x^{k}\Vert^2]$. Next, we apply our method to derive two accelerated randomized block coordinate variants of the forward-backward splitting and Douglas-Rachford splitting schemes, respectively for solving a monotone inclusion involving the sum of two operators. As a byproduct, these variants also have faster convergence rates than their non-accelerated counterparts. Finally, we apply our scheme to a finite-sum monotone inclusion that has various applications in machine learning and statistical learning, including federated learning. As a result, we obtain a novel federated learning-type algorithm with fast and provable convergence rates.  ( 2 min )
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v3 [cs.LG] UPDATED)
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.  ( 3 min )
    Mesoscopic modeling of hidden spiking neurons. (arXiv:2205.13493v2 [q-bio.NC] UPDATED)
    Can we use spiking neural networks (SNN) as generative models of multi-neuronal recordings, while taking into account that most neurons are unobserved? Modeling the unobserved neurons with large pools of hidden spiking neurons leads to severely underconstrained problems that are hard to tackle with maximum likelihood estimation. In this work, we use coarse-graining and mean-field approximations to derive a bottom-up, neuronally-grounded latent variable model (neuLVM), where the activity of the unobserved neurons is reduced to a low-dimensional mesoscopic description. In contrast to previous latent variable models, neuLVM can be explicitly mapped to a recurrent, multi-population SNN, giving it a transparent biological interpretation. We show, on synthetic spike trains, that a few observed neurons are sufficient for neuLVM to perform efficient model inversion of large SNNs, in the sense that it can recover connectivity parameters, infer single-trial latent population activity, reproduce ongoing metastable dynamics, and generalize when subjected to perturbations mimicking photo-stimulation.  ( 2 min )
    A Classification of $G$-invariant Shallow Neural Networks. (arXiv:2205.09219v5 [cs.LG] UPDATED)
    When trying to fit a deep neural network (DNN) to a $G$-invariant target function with $G$ a group, it only makes sense to constrain the DNN to be $G$-invariant as well. However, there can be many different ways to do this, thus raising the problem of ``$G$-invariant neural architecture design'': What is the optimal $G$-invariant architecture for a given problem? Before we can consider the optimization problem itself, we must understand the search space, the architectures in it, and how they relate to one another. In this paper, we take a first step towards this goal; we prove a theorem that gives a classification of all $G$-invariant single-hidden-layer or ``shallow'' neural network ($G$-SNN) architectures with ReLU activation for any finite orthogonal group $G$, and we prove a second theorem that characterizes the inclusion maps or ``network morphisms'' between the architectures that can be leveraged during neural architecture search (NAS). The proof is based on a correspondence of every $G$-SNN to a signed permutation representation of $G$ acting on the hidden neurons; the classification is equivalently given in terms of the first cohomology classes of $G$, thus admitting a topological interpretation. The $G$-SNN architectures corresponding to nontrivial cohomology classes have, to our knowledge, never been explicitly identified in the literature previously. Using a code implementation, we enumerate the $G$-SNN architectures for some example groups $G$ and visualize their structure. Finally, we prove that architectures corresponding to inequivalent cohomology classes coincide in function space only when their weight matrices are zero, and we discuss the implications of this for NAS.  ( 3 min )
    Exponential Family Model-Based Reinforcement Learning via Score Matching. (arXiv:2112.14195v2 [cs.LG] UPDATED)
    We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Under standard regularity assumptions, SMRL achieves $\tilde O(d\sqrt{H^3T})$ online regret, where $H$ is the length of each episode and $T$ is the total number of interactions (ignoring polynomial dependence on structural scale parameters).  ( 2 min )
    Meta-Analysis of Randomized Experiments with Applications to Heavy-Tailed Response Data. (arXiv:2112.07602v5 [stat.ME] UPDATED)
    A central obstacle in the objective assessment of treatment effect (TE) estimators in randomized control trials (RCTs) is the lack of ground truth (or validation set) to test their performance. In this paper, we propose a novel cross-validation-like methodology to address this challenge. The key insight of our procedure is that the noisy (but unbiased) difference-of-means estimate can be used as a ground truth ``label" on a portion of the RCT, to test the performance of an estimator trained on the other portion. We combine this insight with an aggregation scheme, which borrows statistical strength across a large collection of RCTs, to present an end-to-end methodology for judging an estimator's ability to recover the underlying treatment effect as well as produce an optimal treatment "roll out" policy. We evaluate our methodology across 699 RCTs implemented in the Amazon supply chain. In this heavy-tailed setting, our methodology suggests that procedures that aggressively downweight or truncate large values, while introducing bias, lower the variance enough to ensure that the treatment effect is more accurately estimated.  ( 2 min )
    Wasserstein Iterative Networks for Barycenter Estimation. (arXiv:2201.12245v2 [cs.LG] UPDATED)
    Wasserstein barycenters have become popular due to their ability to represent the average of probability measures in a geometrically meaningful way. In this paper, we present an algorithm to approximate the Wasserstein-2 barycenters of continuous measures via a generative model. Previous approaches rely on regularization (entropic/quadratic) which introduces bias or on input convex neural networks which are not expressive enough for large-scale tasks. In contrast, our algorithm does not introduce bias and allows using arbitrary neural networks. In addition, based on the celebrity faces dataset, we construct Ave, celeba! dataset which can be used for quantitative evaluation of barycenter algorithms by using standard metrics of generative models such as FID.  ( 2 min )
    Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods. (arXiv:2205.01303v3 [stat.ML] UPDATED)
    In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a "converse" Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).  ( 2 min )
    Optimization-based Causal Estimation from Heterogenous Environments. (arXiv:2109.11990v2 [stat.ME] UPDATED)
    This paper presents a new optimization approach to causal estimation. Given data that contains covariates and an outcome, which covariates are causes of the outcome, and what is the strength of the causality? In classical machine learning (ML), the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit a non-causal association to the outcome. Such spurious associations provide predictive power for classical ML, but they prevent us from causally interpreting the result. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recently-proposed idea of environments, datasets of covariates/response where the causal relationships remain invariant but where the distribution of the covariates changes from environment to environment. Given datasets from multiple environments -- and ones that exhibit sufficient heterogeneity -- CoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and existing methods, CoCo provides more accurate estimates of the causal model.  ( 2 min )
    Computationally Efficient Approximations for Matrix-based Renyi's Entropy. (arXiv:2112.13720v4 [stat.ML] UPDATED)
    The recently developed matrix based Renyi's entropy enables measurement of information in data simply using the eigenspectrum of symmetric positive semi definite (PSD) matrices in reproducing kernel Hilbert space, without estimation of the underlying data distribution. This intriguing property makes the new information measurement widely adopted in multiple statistical inference and learning tasks. However, the computation of such quantity involves the trace operator on a PSD matrix $G$ to power $\alpha$(i.e., $tr(G^\alpha)$), with a normal complexity of nearly $O(n^3)$, which severely hampers its practical usage when the number of samples (i.e., $n$) is large. In this work, we present computationally efficient approximations to this new entropy functional that can reduce its complexity to even significantly less than $O(n^2)$. To this end, we leverage the recent progress on Randomized Numerical Linear Algebra, developing Taylor, Chebyshev and Lanczos approximations to $tr(G^\alpha)$ for arbitrary values of $\alpha$ by converting it into matrix-vector multiplications problem. We also establish the connection between the matrix-based Renyi's entropy and PSD matrix approximation, which enables exploiting both clustering and block low-rank structure of $G$ to further reduce the computational cost. We theoretically provide approximation accuracy guarantees and illustrate the properties of different approximations. Large-scale experimental evaluations on both synthetic and real-world data corroborate our theoretical findings, showing promising speedup with negligible loss in accuracy.  ( 2 min )
    Reinforcement Learning for Joint Optimization of Multiple Rewards. (arXiv:1909.02940v4 [cs.LG] UPDATED)
    Finding optimal policies which maximize long term rewards of Markov Decision Processes requires the use of dynamic programming and backward induction to solve the Bellman optimality equation. However, many real-world problems require optimization of an objective that is non-linear in cumulative rewards for which dynamic programming cannot be applied directly. For example, in a resource allocation problem, one of the objectives is to maximize long-term fairness among the users. We notice that when an agent aim to optimize some function of the sum of rewards is considered, the problem loses its Markov nature. This paper addresses and formalizes the problem of optimizing a non-linear function of the long term average of rewards. We propose model-based and model-free algorithms to learn the policy, where the model-based policy is shown to achieve a regret of $\Tilde{O}\left(LKDS\sqrt{\frac{A}{T}}\right)$ for $K$ objectives combined with a concave $L$-Lipschitz function. Further, using the fairness in cellular base-station scheduling, and queueing system scheduling as examples, the proposed algorithm is shown to significantly outperform the conventional RL approaches.  ( 2 min )
    PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks. (arXiv:2006.07794v2 [cs.LG] UPDATED)
    Large capacity deep learning models are often prone to a high generalization gap when trained with a limited amount of labeled training data. A recent class of methods to address this problem uses various ways to construct a new training sample by mixing a pair (or more) of training samples. We propose PatchUp, a hidden state block-level regularization technique for Convolutional Neural Networks (CNNs), that is applied on selected contiguous blocks of feature maps from a random pair of samples. Our approach improves the robustness of CNN models against the manifold intrusion problem that may occur in other state-of-the-art mixing approaches. Moreover, since we are mixing the contiguous block of features in the hidden space, which has more dimensions than the input space, we obtain more diverse samples for training towards different dimensions. Our experiments on CIFAR10/100, SVHN, Tiny-ImageNet, and ImageNet using ResNet architectures including PreActResnet18/34, WRN-28-10, ResNet101/152 models show that PatchUp improves upon, or equals, the performance of current state-of-the-art regularizers for CNNs. We also show that PatchUp can provide a better generalization to deformed samples and is more robust against adversarial attacks.  ( 2 min )
    Differentially private inference via noisy optimization. (arXiv:2103.11003v3 [math.ST] UPDATED)
    We propose a general optimization-based framework for computing differentially private M-estimators and a new method for constructing differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent or noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish local and global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a nearly optimal neighborhood of the non-private M-estimators. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to approximate pivotal statistics for constructing confidence regions and conducting hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods in several numerical examples.  ( 2 min )
    Simple Binary Hypothesis Testing under Local Differential Privacy and Communication Constraints. (arXiv:2301.03566v1 [math.ST])
    We study simple binary hypothesis testing under both local differential privacy (LDP) and communication constraints. We qualify our results as either minimax optimal or instance optimal: the former hold for the set of distribution pairs with prescribed Hellinger divergence and total variation distance, whereas the latter hold for specific distribution pairs. For the sample complexity of simple hypothesis testing under pure LDP constraints, we establish instance-optimal bounds for distributions with binary support; minimax-optimal bounds for general distributions; and (approximately) instance-optimal, computationally efficient algorithms for general distributions. When both privacy and communication constraints are present, we develop instance-optimal, computationally efficient algorithms that achieve the minimum possible sample complexity (up to universal constants). Our results on instance-optimal algorithms hinge on identifying the extreme points of the joint range set $\mathcal A$ of two distributions $p$ and $q$, defined as $\mathcal A := \{(\mathbf T p, \mathbf T q) | \mathbf T \in \mathcal C\}$, where $\mathcal C$ is the set of channels characterizing the constraints.  ( 2 min )
    Concentration of measure and generalized product of random vectors with an application to Hanson-Wright-like inequalities. (arXiv:2102.08020v3 [math.PR] UPDATED)
    Starting from concentration of measure hypotheses on $m$ random vectors $Z_1,\ldots, Z_m$, this article provides an expression of the concentration of functionals $\phi(Z_1,\ldots, Z_m)$ where the variations of $\phi$ on each variable depend on the product of the norms (or semi-norms) of the other variables (as if $\phi$ were a product). We illustrate the importance of this result through various generalizations of the Hanson-Wright concentration inequality as well as through a study of the random matrix $XDX^T$ and its resolvent $Q = (I_p - \frac{1}{n}XDX^T)^{-1}$, where $X$ and $D$ are random, which have fundamental interest in statistical machine learning applications.  ( 2 min )
    Spectral properties of sample covariance matrices arising from random matrices with independent non identically distributed columns. (arXiv:2109.02644v3 [math.PR] UPDATED)
    Given a random matrix $X= (x_1,\ldots, x_n)\in \mathcal M_{p,n}$ with independent columns and satisfying concentration of measure hypotheses and a parameter $z$ whose distance to the spectrum of $\frac{1}{n} XX^T$ should not depend on $p,n$, it was previously shown that the functionals $\text{tr}(AR(z))$, for $R(z) = (\frac{1}{n}XX^T- zI_p)^{-1}$ and $A\in \mathcal M_{p}$ deterministic, have a standard deviation of order $O(\|A\|_* / \sqrt n)$. Here, we show that $\|\mathbb E[R(z)] - \tilde R(z)\|_F \leq O(1/\sqrt n)$, where $\tilde R(z)$ is a deterministic matrix depending only on $z$ and on the means and covariances of the column vectors $x_1,\ldots, x_n$ (that do not have to be identically distributed). This estimation is key to providing accurate fluctuation rates of functionals of $X$ of interest (mostly related to its spectral properties) and is proved thanks to the introduction of a semi-metric $d_s$ defined on the set $\mathcal D_n(\mathbb H)$ of diagonal matrices with complex entries and positive imaginary part and satisfying, for all $D,D' \in \mathcal D_n(\mathbb H)$: $d_s(D,D') = \max_{i\in[n]} |D_i - D_i'|/ (\Im(D_i) \Im(D_i'))^{1/2}$. Possibly most importantly, the underlying concentration of measure assumption on the columns of $X$ finds an extremely natural ground for application in modern statistical machine learning algorithms where non-linear Lipschitz mappings and high number of classes form the base ingredients.  ( 2 min )

  • Open

    [D] Form on sharing ML codes
    Hello everyone, I would kindly ask you if you could help me get some insights about people’s preferences when sharing ML codes, with special focus on neural networks. I am here linking a very quick Google Form. Please, feel free to reach out. https://forms.gle/4zg5HLqLaEESuVTz9 submitted by /u/Fc3692 [link] [comments]  ( 57 min )
    [D] Soft Prompt Training Issue
    I am implementing soft prompt tuning (reproducing https://arxiv.org/abs/2104.08691v2) for my research project, but the training makes the model predict "False" only in T/F classification task (BoolQ dataset). I have tried all other code on full model fine-tuning and it's working to exclude all other issues (so it's unrelated to dataset and trainer). ​ There are some observations to exclude possible issues. The soft prompt parameters do change during soft prompt training. (gradient backprop on soft prompt is working) Training loss goes down normally just as model fine-tuning Any idea on how to debug the issue. submitted by /u/SEAIndigenous [link] [comments]  ( 57 min )
    [N] Microsoft Considers $10 Billion Investment in ChatGPT Creator --Bloomberg News
    Story here: https://www.bloomberg.com/news/articles/2023-01-10/microsoft-weighs-10-billion-chatgpt-investment-semafor-says?srnd=premium Unpaywalled: https://archive.ph/XOOlg submitted by /u/bikeskata [link] [comments]  ( 61 min )
    [D] Found very similar paper to my submitted paper on Arxiv
    If the mods want to ban this because it falls outside of meaningful discussion, that’s ok. I have a paper in the review process for CVPR atm. A couple of hours ago I stumbled upon an Arxiv paper upload 2 days ago that replicates my method almost exactly save for a few differences in how the inputs are processed, and how the problem is defined (super super similar problems). Their paper achieves far better results than mine, is tested on more datasets than mine, and comes from a big well known research group in my field to boot. I guess I feel a bit dejected? The approach was truly novel and nobody had done it before. Even with my limited training, was showing very promising results. I couldn’t train for longer to improve the model further due to a lack of hardware/budget, and I couldn’t test on more datasets for the same reason. It’ll probably get rejected from cvpr for those very reasons. I’m not complaining about this, it was my decision to submit there and take the chance, but damn. In hindsight I should’ve maybe gone for an easier journal or something and at least be guaranteed to be the first. 😔 Sorry if that was a bit of a rant, I just figure people here can relate a bit. submitted by /u/TightestKnees [link] [comments]  ( 64 min )
    [P] Evaluating several topic modeling implementations. What's the current best practice? BERTopic? OpenAI Ada-002?
    I have a set of ~100 topic categories, and I want to determine which are semantically close to a text input. I've found several implementations, but I know some (LDA) are already obsolete. OpenAI's text-embedding-ada-002 model just came out so I'm wondering if that's the best option now. Other topic modeling implementations: Multi-Class Text Classification with Doc2Vec & Logistic Regression Build taxonomy-based contextual targeting using AWS Media Intelligence and Hugging Face BERT Topic Modeling with BERTopic submitted by /u/gravenbirdman [link] [comments]  ( 60 min )
    [R] Class-Continuous Conditional Generative Nerual Radiance Field
    Paper: https://arxiv.org/abs/2301.00950 Project Page: https://tom919654.github.io/C3G_NeRF/ (Videos included) Code: https://github.com/tom919654/C3G-NeRF ​ Abstract: The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently,Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called ClassContinuous Conditional Generative NeRF (C3GNeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed C3GNeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, C3G-NeRF exhibits a Frechet Inception Distance (FID) of 7.64 in 3D- aware face image synthesis with a 1282 resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with C3G-NeRF. ​ Results: https://preview.redd.it/9y8b316dx4ba1.png?width=1750&format=png&auto=webp&s=8470793ff3974932058e1bd196ba7a857375f22f ​ https://preview.redd.it/0icxw8blx4ba1.png?width=1595&format=png&auto=webp&s=4a7c7226d7a9c5a2b28492aca8157c7037b4a647 https://reddit.com/link/107z8z1/video/ka6cqf1gx4ba1/player ​ https://reddit.com/link/107z8z1/video/2lm0bamix4ba1/player submitted by /u/JiwookKim [link] [comments]  ( 60 min )
    [D] Sample Average Approximation-Samples to be used
    Considering a Machine Learning scenario with some pre-available training samples S. In the objective function, let's suppose we have expectation over some reference distribution P_0 whose parameter (example, mean) has been approximated based on training samples S. When performing Sample Average Approximation for that expectation, is it necessary that we sample from the distribution of interest P_0 or can we directly use the training samples that we have? Could you please help me understand this? submitted by /u/RecentUnicorn [link] [comments]  ( 57 min )
  • Open

    "Comments on the Origin and Application of Markov Decision Processes", Howard 2002 (optimizing Sears Catalogue mailings ~1959 with value iteration & inventing policy iteration)
    submitted by /u/gwern [link] [comments]  ( 59 min )
    Episode Q0 is decreasing while cumulative reward increases (and doesn't converge to an optimal policy)
    I am using Matlab Simulink/Simscape to train a two-wheeled balancing robot to balance using a DDPG agent. I've tried tuning hyperparameters like learning rate, discount factor, and mini batch size to no avail. I've tweaked my reward function many times, and I feel like it's alright. For some reason my episode Q0 decreases when my reward improves. I believe this indicates that the critic and actor are disagreeing. Right? Does anybody have suggestions? If need be, I can include my script. Training Curve (orange is the episode Q0 and blue is the episode cumulative reward) submitted by /u/FenderBender43 [link] [comments]  ( 24 min )
    Let’s learn how to use Unity ML-Agents and train a bear 🐻 to shoot snowballs (Deep Reinforcement Learning Free Course by Hugging Face 🤗)
    Hey there! I’m happy to announce that we just published the fifth Unit of the Deep Reinforcement Learning Course 🥳 In this Unit, we’ll learn to use the Unity ML-Agents library by training two agents: The first one will learn to shoot snowballs at the spawning target. The second need to press a button to spawn a pyramid, then navigate to the pyramid, knock it over, and move to the gold brick at the top. To do that, it will need to explore its environment, and we will use a technique called curiosity. Then, after training, you’ll push the trained agents to the Hugging Face Hub, and you’ll be able to visualize it playing directly on your browser without having to use the Unity Editor Start Learning now 👉 https://huggingface.co/deep-rl-course/unit5/introduction https://preview.redd.it/l1nhb7vz59ba1.png?width=1920&format=png&auto=webp&s=3720dc7a454bdf6e4736a4ec3a4914647a06b564 If you want to start studying Deep Reinforcement Learning. We launched this course, and you’re right on time: 2023 is the perfect year to start. We wrote an introduction unit to help you get started. You can start learning now 👉 https://huggingface.co/deep-rl-course/unit0/introduction If you have questions or feedback I would love to answer them. submitted by /u/cranthir_ [link] [comments]  ( 58 min )
    Gymnasium-Robotics 1.2.0 - which now includes maintained versions of the environments from D4RL and the MaMuJoCo environments - is now live
    submitted by /u/jkterry1 [link] [comments]  ( 57 min )
    Deep RL with Mujoco environments using Docker on Apple Silicon
    Hi everyone, I am relatively new to the apple ecosystem, their way of doing things and docker in general. I want to get started with using mujoco environments from within a docker image. ​ The reason for the same being: I do not want to pollute my path and OS with multiple mujoco and deep RL library's dependencies. Secondly, I would like to have a reproducible image that can be used cross platform. ​ I generally debug on my personal machine, and do the actual training on university provided machines which utilise Nvidia hardware. Is there a way to accomplish all my goals? If not, can I at least successfully complete goal 1? submitted by /u/adeecc [link] [comments]  ( 65 min )
    Reconstruction loss for VAE model for skill learning
    I was reading this paper ASPiRe: Adaptive Skill Priors for Reinforcement Learning When looking at their code to train the VAE and priors for skills, I noticed for the reconstruction loss, instead of using classical mean squared error, they use: loss = -Normal(loc=actions_hat, scale=1).logprob(actions) It can be seen in line 179 of this file. I had never seen anyone use this reconstruction loss. Is there a good reason to use this loss? any empirical support? I'd appreciate any help to address this question. Edit*: I just learnt this is the negative likelihood loss. My question still remains. Why is this preferable to MSE? submitted by /u/carlml [link] [comments]  ( 62 min )
  • Open

    A small reminder to appreciate the fundamentals!
    submitted by /u/Imagine-your-success [link] [comments]  ( 47 min )
    Any good ones for real video to animated?
    Hi, Looking for a solution to turn a real video into an animated video through AI or any automation process. I've some camera shy people that want to make a video. Animating it from scratch would make the price go 10x so I'm looking for a solution in the AI space. If you know, please drop a comment. submitted by /u/V-Sec [link] [comments]  ( 48 min )
    AI is not alien, it's us
    submitted by /u/pbw [link] [comments]  ( 46 min )
    AI stack for 2023 - any tools missing to work with this year?
    Thought this was a cool graphic - pulled from https://buildspace.so/notes/ai-stack-2023 (free resource) https://preview.redd.it/uxp5tbcmp9ba1.png?width=456&format=png&auto=webp&s=e31e1f81b7250ddafa5359579e00b4f600def00b submitted by /u/bruclinbrocoli [link] [comments]  ( 49 min )
    in a new article on Hackernoon, i write about how copy and paste can be said to be a forerunner for the digital revolution of the AI text generator. kindly read it here: https://hackernoon.com/from-copy-and-paste-to-ai-text-generator-a-revolution-of-the-digital-age
    submitted by /u/Techoyy [link] [comments]  ( 47 min )
    Is law future proof career ?
    As a newly graduated lawyer, I have been experiencing anxiety about the future of my profession. With the rapid advancements in technology, I can't help but wonder if the legal field will become foreseeable future and if my four years of law school were all for vain , my fears increased after release of GPT-3 , should i think about career change on software dev/Web dev to adapt with changing reality ? submitted by /u/No_Car5573 [link] [comments]  ( 52 min )
    Nerf Technology with Stable Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 48 min )
    German Drugmaker BioNTech To Buy AI Startup InstaDeep In $680M Deal
    submitted by /u/The-Techie [link] [comments]  ( 47 min )
    A good free text generator
    Hey, im looking for an actual free text generator that i can write simple articles of 100-300 words with. Most of these "free" sites offer a certain amount of credits etc. Preferably in multiple languages. Anyone have any ideas? submitted by /u/HutsKoning69 [link] [comments]  ( 48 min )
    What is the best paper on AI that you have read in 2022 and why?
    submitted by /u/tiensss [link] [comments]  ( 48 min )
    Some Ultra-Modern Generative Ai
    submitted by /u/Imagine-your-success [link] [comments]  ( 54 min )
    Bringing Extinct Dinosaurs Back To Life Using AI
    submitted by /u/liquidocelotYT [link] [comments]  ( 47 min )
    How You can Apply AI Logics for Scaling Your Business in 2023?
    submitted by /u/yudiz [link] [comments]  ( 48 min )
    Microsoft's AI Tool VALL-E can imitate anyone's voice with just a three-second sample
    submitted by /u/qptbook [link] [comments]  ( 47 min )
    Microsoft Will Likely Invest $10 billion for 49 Percent Stake in OpenAI
    submitted by /u/BackgroundResult [link] [comments]  ( 62 min )
    Scanning text for sensitive/controversial material
    Is there any way to use AI to process text like that? I heard that HR companies have similar processes to go trough social media profiles and look at “red flags”, but is there any engine available to do that on any piece of text? Sorry if it seems like an obvious question, I’m not too well versed in the topic and google is no help as usual. submitted by /u/Big_Razzmatazz_9251 [link] [comments]  ( 52 min )
    Microsoft to Own 49% of OpenAI Once $10B Deal Closes
    submitted by /u/lambolifeofficial [link] [comments]  ( 45 min )
    ChatGPT - code and talks
    submitted by /u/adrp23 [link] [comments]  ( 47 min )
    AI-powered "robot" lawyer will be first of its kind to represent defendant in court
    submitted by /u/Itchy0101 [link] [comments]  ( 49 min )
    chatGPT knowingly withholds information, reveals it slowly upon nudging
    submitted by /u/reportaman [link] [comments]  ( 56 min )
    Best Language AI models to run locally?
    I was curious as to if there are any Language AI models (like Chat GPT) that you can use locally on your own machine (so you don't run into as many server issues like you do on Chat GPT). If so, which are the best and where can you find them? submitted by /u/CreativePolymath [link] [comments]  ( 46 min )
    artflow ai + waifu2x. Tatar slim girl with black hair +
    submitted by /u/SubjectAd1535 [link] [comments]  ( 47 min )
    How can you use/get access to Chinchilla AI?
    So I've been hearing about the Chinchilla AI model (text ai, like Chat GPT) and how great it is, but I haven't seen anything regarding the model or how to use it, or even if it's available to the public yet or not. Does anyone have any insight into this? Is there any way to use it currently? I've worked with Stable Diffusion through A1111, would there be a model that works with A1111? Thanks in advance! submitted by /u/CreativePolymath [link] [comments]  ( 47 min )
    Weekly China AI News from Jan.2 to Jan.8: AI Dominates 2023 Top Tech Trends; Behind Douyin's Popular AI Anime Effect; Go Master Accused of Cheating with AI-Like Play Style
    submitted by /u/trcytony [link] [comments]  ( 46 min )
  • Open

    DSC Weekly 10 January 2023 – Recession Analysis
    Announcements Recession Analysis Editor’s Note: This week’s editorial was written by a guest contributor. If you would like to submit an article or editorial, please contact the editors below. One question that many are pondering is whether we are entering in a recession, and when it will end. Especially regarding the tech sector. Are there… Read More »DSC Weekly 10 January 2023 – Recession Analysis The post DSC Weekly 10 January 2023 – Recession Analysis appeared first on Data Science Central.  ( 20 min )
    Solution vs. Product-Implications for Agile Development
    A Solution is Not a Product Solution, Deliverable, Product, Work Product and other terms are often used interchangeably to describe output from development initiatives.  However, there are some extremely important conceptual differences between the terms Solution and Product and understanding them should inform and guide your thinking about what you are doing as you go… Read More »Solution vs. Product-Implications for Agile Development The post Solution vs. Product-Implications for Agile Development appeared first on Data Science Central.  ( 22 min )
    Responsible AI by design
    Happy new year! One of the big trends in AI this year: AI is maturing as a domain. We are using AI to address complex problems. That means we will need to be more aware of the potential downsides of AI. I believe that a new trend could manifest: responsible AI by design. a) Responsible… Read More »Responsible AI by design The post Responsible AI by design appeared first on Data Science Central.  ( 19 min )
    Flexible Engagement Model to Hire Full-Stack Developers: A 2023 Guide
    The demand for web and app development is rising as there are significant improvements in the field of technology every year. More and more people are establishing careers in the field of development as a result of advancements in technologies and frameworks. As a result, the data from Statista says that by 2024, there will… Read More »Flexible Engagement Model to Hire Full-Stack Developers: A 2023 Guide The post Flexible Engagement Model to Hire Full-Stack Developers: A 2023 Guide appeared first on Data Science Central.  ( 21 min )
  • Open

    is there any software like AI or image editing I could use for generating an image of a paper with text on it (like someone wrote on it)
    Do you guys know these apps like yandex image translator that generate text out of an image ? I want to basically do the opposite but into images of papers (like someone has written on it) submitted by /u/SnooPineapples7791 [link] [comments]  ( 47 min )
    Nerf Technology with Stable Diffusion
    submitted by /u/oridnary_artist [link] [comments]  ( 51 min )
  • Open

    The Greenest Generation: NVIDIA, Intel and Partners Supercharge AI Computing Efficiency
    AI is at the heart of humanity’s most transformative innovations — from developing COVID vaccines at unprecedented speeds and diagnosing cancer to powering autonomous vehicles and understanding climate change. Virtually every industry will benefit from adopting AI, but the technology has become more resource intensive as neural networks have increased in complexity. To avoid placing Read article >  ( 7 min )
  • Open

    Best practices for load testing Amazon SageMaker real-time inference endpoints
    Amazon SageMaker is a fully managed machine learning (ML) service. With SageMaker, data scientists and developers can quickly and easily build and train ML models, and then directly deploy them into a production-ready hosted environment. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so […]  ( 12 min )
    Get smarter search results with the Amazon Kendra Intelligent Ranking and OpenSearch plugin
    If you’ve had the opportunity to build a search application for unstructured data (i.e., wiki, informational web sites, self-service help pages, internal documentation, etc.) using open source or commercial-off-the-shelf search engines, then you’re probably familiar with the inherent accuracy challenges involved in getting relevant search results. The intended meaning of both query and document can […]  ( 12 min )
  • Open

    Approximating 1/Γ(x)
    A few days ago a comment that a graph looked like a Maxwell-Boltzman density lead to an approximation of 1/Γ(x), possibly a useful approximation. Approximating Γ(x) is a well-known problem, and for large x the solution is to use Stirling’s approximation or a few more terms from the asymptotic series that Stirling’s approximation is a […] Approximating 1/Γ(x) first appeared on John D. Cook.  ( 6 min )
  • Open

    Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers. (arXiv:2203.09142v2 [cs.LG] UPDATED)
    Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the nonsmooth functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.
    Reliable Time Prediction in the Markov Stochastic Block Model. (arXiv:2004.04402v3 [cs.SI] UPDATED)
    We introduce the Markov Stochastic Block Model (MSBM): a growth model for community based networks where node attributes are assigned through a Markovian dynamic. We rely on HMMs' literature to design prediction methods that are robust to local clustering errors. We focus specifically on the link prediction and collaborative filtering problems and we introduce a new model selection procedure to infer the number of hidden clusters in the network. Our approaches for reliable prediction in MSBMs are not algorithm-dependent in the sense that they can be applied using your favourite clustering tool. In this paper, we use a recent SDP method to infer the hidden communities and we provide theoretical guarantees. In particular, we identify the relevant signal-to-noise ratio (SNR) in our framework and we prove that the misclassification error decays exponentially fast with respect to this SNR.  ( 2 min )
    Signal Enhancement for Magnetic Navigation Challenge Problem. (arXiv:2007.12158v2 [cs.LG] UPDATED)
    Harnessing the magnetic field of the Earth for navigation has shown promise as a viable alternative to other navigation systems. A magnetic navigation system collects its own magnetic field data using a magnetometer and uses magnetic anomaly maps to determine the current location. The greatest challenge with magnetic navigation arises when the magnetic field measurements from the magnetometer encompass the magnetic field from not just the Earth, but also from the vehicle on which it is mounted. It is difficult to separate the Earth magnetic anomaly field, which is crucial for navigation, from the total magnetic field reading from the sensor. The purpose of this challenge problem is to decouple the Earth and aircraft magnetic signals in order to derive a clean signal from which to perform magnetic navigation. Baseline testing on the dataset has shown that the Earth magnetic field can be extracted from the total magnetic field using machine learning (ML). The challenge is to remove the aircraft magnetic field from the total magnetic field using a trained model. This challenge offers an opportunity to construct an effective model for removing the aircraft magnetic field from the dataset by using a scientific machine learning (SciML) approach comprised of an ML algorithm integrated with the physics of magnetic navigation.  ( 2 min )
    Ranking Inferences Based on the Top Choice of Multiway Comparisons. (arXiv:2211.11957v3 [stat.ME] UPDATED)
    This paper considers ranking inference of $n$ items based on the observed data on the top choice among $M$ randomly selected items at each trial. This is a useful modification of the Plackett-Luce model for $M$-way ranking with only the top choice observed and is an extension of the celebrated Bradley-Terry-Luce model that corresponds to $M=2$. Under a uniform sampling scheme in which any $M$ distinguished items are selected for comparisons with probability $p$ and the selected $M$ items are compared $L$ times with multinomial outcomes, we establish the statistical rates of convergence for underlying $n$ preference scores using both $\ell_2$-norm and $\ell_\infty$-norm, with the minimum sampling complexity. In addition, we establish the asymptotic normality of the maximum likelihood estimator that allows us to construct confidence intervals for the underlying scores. Furthermore, we propose a novel inference framework for ranking items through a sophisticated maximum pairwise difference statistic whose distribution is estimated via a valid Gaussian multiplier bootstrap. The estimated distribution is then used to construct simultaneous confidence intervals for the differences in the preference scores and the ranks of individual items. They also enable us to address various inference questions on the ranks of these items. Extensive simulation studies lend further support to our theoretical results. A real data application illustrates the usefulness of the proposed methods convincingly.  ( 2 min )
    Evaluating counterfactual explanations using Pearl's counterfactual method. (arXiv:2301.02499v1 [stat.ML])
    Counterfactual explanations (CEs) are methods for generating an alternative scenario that produces a different desirable outcome. For example, if a student is predicted to fail a course, then counterfactual explanations can provide the student with alternate ways so that they would be predicted to pass. The applications are many. However, CEs are currently generated from machine learning models that do not necessarily take into account the true causal structure in the data. By doing this, bias can be introduced into the CE quantities. I propose in this study to test the CEs using Judea Pearl's method of computing counterfactuals which has thus far, surprisingly, not been seen in the counterfactual explanation (CE) literature. I furthermore evaluate these CEs on three different causal structures to show how the true underlying causal structure affects the CEs that are generated. This study presented a method of evaluating CEs using Pearl's method and it showed, (although using a limited sample size), that thirty percent of the CEs conflicted with those computed by Pearl's method. This shows that we cannot simply trust CEs and it is vital for us to know the true causal structure before we blindly compute counterfactuals using the original machine learning model.  ( 2 min )
    Learning from a Biased Sample. (arXiv:2209.01754v2 [stat.ME] UPDATED)
    The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it in. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in simulations and a case study on ICU length of stay prediction.  ( 2 min )
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v5 [cs.LG] UPDATED)
    Different from existing Differential Privacy (DP) accountants, we introduce pro-active DP. Existing DP accountants keep track of how privacy budget has been spent while pro-active DP is a scheme that allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. To implement this idea, we show how to convert the classical DP moment accountant to a pro-active DP by exploiting the fact that it has a simple close form for computing spent privacy budget for a given interaction round. The DP moment accountant is introduced in context of DP-SGD and has the following property which is the key ingredient to build pro-active DP. In DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise with standard deviation $\sigma$ is added to local SGD updates after performing a clipping operation and normalizing with the clipping constant. We show that for attaining $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ for $\epsilon=\Omega(T/N^2)$, where $T$ is the total number of rounds and $N$ is equal to the size of the local data set. In many existing machine learning problems, $N$ is always large and $T=O(N)$. Hence, $\sigma$ becomes ``independent'' of any $T=O(N)$ choice with $\epsilon=\Omega(1/N)$. This means that our {\em $\sigma$ only depends on $N$ rather than $T$}. We show how this differential privacy characterization allows us to convert DP moment accountant to a pro-active DP.  ( 3 min )
    Valid P-Value for Deep Learning-Driven Salient Region. (arXiv:2301.02437v1 [stat.ML])
    Various saliency map methods have been proposed to interpret and explain predictions of deep learning models. Saliency maps allow us to interpret which parts of the input signals have a strong influence on the prediction results. However, since a saliency map is obtained by complex computations in deep learning models, it is often difficult to know how reliable the saliency map itself is. In this study, we propose a method to quantify the reliability of a salient region in the form of p-values. Our idea is to consider a salient region as a selected hypothesis by the trained deep learning model and employ the selective inference framework. The proposed method can provably control the probability of false positive detections of salient regions. We demonstrate the validity of the proposed method through numerical examples in synthetic and real datasets. Furthermore, we develop a Keras-based framework for conducting the proposed selective inference for a wide class of CNNs without additional implementation cost.  ( 2 min )
    DANLIP: Deep Autoregressive Networks for Locally Interpretable Probabilistic Forecasting. (arXiv:2301.02332v1 [cs.LG])
    Despite the high performance of neural network-based time series forecasting methods, the inherent challenge in explaining their predictions has limited their applicability in certain application areas. Due to the difficulty in identifying causal relationships between the input and output of such black-box methods, they rarely have been adopted in domains such as legal and medical fields in which the reliability and interpretability of the results can be essential. In this paper, we propose \model, a novel deep learning-based probabilistic time series forecasting architecture that is intrinsically interpretable. We conduct experiments with multiple datasets and performance metrics and empirically show that our model is not only interpretable but also provides comparable performance to state-of-the-art probabilistic time series forecasting methods. Furthermore, we demonstrate that interpreting the parameters of the stochastic processes of interest can provide useful insights into several application areas.  ( 2 min )
    Multi-treatment Effect Estimation from Biomedical Data. (arXiv:2112.07574v3 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v3 [stat.ML] UPDATED)
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    Low-rank Approximation of Linear Maps. (arXiv:1812.09042v2 [stat.ML] UPDATED)
    This work provides closed-form solutions and minimum achievable errors for a large class of low-rank approximation problems in Hilbert spaces. The proposed theorem generalizes to the case of bounded linear operators the previous results obtained in the finite dimensional case for the Frobenius norm. The theorem provides the basis for the design of tractable algorithms for kernel or continuous DMD.  ( 2 min )
    A Robust Data-driven Process Modeling Applied to Time-series Stochastic Power Flow. (arXiv:2301.02651v1 [eess.SY])
    In this paper, we propose a robust data-driven process model whose hyperparameters are robustly estimated using the Schweppe-type generalized maximum likelihood estimator. The proposed model is trained on recorded time-series data of voltage phasors and power injections to perform a time-series stochastic power flow calculation. Power system data are often corrupted with outliers caused by large errors, fault conditions, power outages, and extreme weather, to name a few. The proposed model downweights vertical outliers and bad leverage points in the measurements of the training dataset. The weights used to bound the influence of the outliers are calculated using projection statistics, which are a robust version of Mahalanobis distances of the time series data points. The proposed method is demonstrated on the IEEE 33-Bus power distribution system and a real-world unbalanced 240-bus power distribution system heavily integrated with renewable energy sources. Our simulation results show that the proposed robust model can handle up to 25% of outliers in the training data set.
    Reversibility of elliptical slice sampling revisited. (arXiv:2301.02426v1 [math.ST])
    We discuss the well-definedness of elliptical slice sampling, a Markov chain approach for approximate sampling of posterior distributions introduced by Murray, Adams and MacKay 2010. We point to a regularity requirement and provide an alternative proof of the reversibility property. In particular, this guarantees the correctness of the slice sampling scheme also on infinite-dimensional separable Hilbert spaces.  ( 2 min )
    Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction. (arXiv:2103.08280v5 [math.OC] UPDATED)
    In this paper, we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider Proximal Incremental First-order (PIFO) algorithms which have access to the gradient and proximal oracles for each component function. To incorporate loopless methods, we also allow PIFO algorithms to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We can also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles.
    Convergence rates of the stochastic alternating algorithm for bi-objective optimization. (arXiv:2203.10605v2 [math.OC] UPDATED)
    Stochastic alternating algorithms for bi-objective optimization are considered when optimizing two conflicting functions for which optimization steps have to be applied separately for each function. Such algorithms consist of applying a certain number of steps of gradient or subgradient descent on each single objective at each iteration. In this paper, we show that stochastic alternating algorithms achieve a sublinear convergence rate of $\mathcal{O}(1/T)$, under strong convexity, for the determination of a minimizer of a weighted-sum of the two functions, parameterized by the number of steps applied on each of them. An extension to the convex case is presented for which the rate weakens to $\mathcal{O}(1/\sqrt{T})$. These rates are valid also in the non-smooth case. Importantly, by varying the proportion of steps applied to each function, one can determine an approximation to the Pareto front.
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces. (arXiv:2211.14400v2 [stat.ML] UPDATED)
    Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev space $W^s(L_q(\Omega))$ with error measured in $L_p(\Omega)$. This problem is important when studying the application of neural networks in scientific computing and has previously been completely solved only in the case $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$, including asymptotically matching upper and lower bounds. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.
  • Open

    Jensen-Shannon Divergence Based Loss Functions for Bayesian Neural Networks. (arXiv:2209.11366v2 [cs.LG] UPDATED)
    The Kullback-Leibler (KL) divergence is widely used for the variational inference of Bayesian Neural Networks (BNNs) to approximate the posterior distribution of weights. However, the KL divergence is unbounded and asymmetric, which may lead to instabilities during optimization or may yield poor generalizations. To overcome these limitations, we examine the Jensen-Shannon (JS) divergence that is more general, bounded, and symmetric. Towards this, we propose two novel loss functions for BNNs: 1) a geometric JS divergence (JS-G) based loss function that is symmetric but unbounded with closed-form expression for Gaussian priors and 2) a generalized JS divergence (JS-A) based loss function that is symmetric and bounded. We show that the conventional KL divergence-based loss function is a special case of the loss functions presented in this work. To evaluate the divergence part of the proposed JS-G-based loss function, we use an exact closed-form expression for Gaussian priors. For any other priors of JS-G and for the JS-A-based loss function we use Monte Carlo approximation. We provide algorithms to optimize the loss function using both these methods. The proposed loss functions offer additional parameters that can be tuned to control the regularisation. We explain the reason why the proposed loss functions should perform better than the state-of-the-art. Further, we derive the conditions under which the proposed JS-G-loss function regularises better than the KL divergence-based loss function for Gaussian priors and posteriors. The proposed JS divergence-based Bayesian convolutional neural networks (BCNN) perform better than the state-of-the-art BCNN, which is shown for the classification of the CIFAR data set having various degrees of noise and a biased histopathology data set.
    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev Spaces. (arXiv:2211.14400v2 [stat.ML] UPDATED)
    Let $\Omega = [0,1]^d$ be the unit cube in $\mathbb{R}^d$. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev space $W^s(L_q(\Omega))$ with error measured in $L_p(\Omega)$. This problem is important when studying the application of neural networks in scientific computing and has previously been completely solved only in the case $p=q=\infty$. Our contribution is to provide a complete solution for all $1\leq p,q\leq \infty$ and $s > 0$, including asymptotically matching upper and lower bounds. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where $p > q$. We also provide a novel method for deriving $L_p$-approximation lower bounds based upon VC-dimension when $p < \infty$. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable.
    Proportional Multicalibration. (arXiv:2209.14613v2 [cs.LG] UPDATED)
    Multicalibration is a desirable fairness criteria that constrains calibration error among flexibly-defined groups in the data while maintaining overall calibration. However, when outcome probabilities are correlated with group membership, multicalibrated models can exhibit a higher percent calibration error among groups with lower base rates than groups with higher base rates. As a result, it remains possible for a decision-maker to learn to trust or distrust model predictions for specific groups. To alleviate this, we propose \emph{proportional multicalibration}, a criteria that constrains the percent calibration error among groups and within prediction bins. We prove that satisfying proportional multicalibration bounds a model's multicalibration as well its \emph{differential calibration}, a stronger fairness criteria inspired by the fairness notion of sufficiency. We provide an efficient algorithm for post-processing risk prediction models for proportional multicalibration and evaluate it empirically. We conduct simulation studies and investigate a real-world application of PMC-postprocessing to prediction of emergency department patient admissions. We observe that proportional multicalibration is a promising criteria for controlling simultaneous measures of calibration fairness of a model over intersectional groups with virtually no cost in terms of classification performance.
    DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning. (arXiv:2111.12062v2 [cs.LG] UPDATED)
    Self-supervised learning algorithms, including BERT and SimCLR, have enabled significant strides in fields like natural language processing, computer vision, and speech processing. However, these algorithms are domain-specific, meaning that new self-supervised learning algorithms must be developed for each new setting, including myriad healthcare, scientific, and multimodal domains. To catalyze progress toward domain-agnostic methods, we introduce DABS: a Domain-Agnostic Benchmark for Self-supervised learning. To perform well on DABS, an algorithm is evaluated on seven diverse domains: natural images, multichannel sensor data, English text, speech recordings, multilingual text, chest x-rays, and images with text descriptions. Each domain contains an unlabeled dataset for pretraining; the model is then is scored based on its downstream performance on a set of labeled tasks in the domain. We also present e-Mix and ShED: two baseline domain-agnostic algorithms; their relatively modest performance demonstrates that significant progress is needed before self-supervised learning is an out-of-the-box solution for arbitrary domains. Code for benchmark datasets and baseline algorithms is available at https://github.com/alextamkin/dabs.
    p-Adic Statistical Field Theory and Deep Belief Networks. (arXiv:2207.13877v4 [math-ph] UPDATED)
    In this work we initiate the study of the correspondence between p-adic statistical field theories (SFTs) and neural networks (NNs). In general quantum field theories over a p-adic spacetime can be formulated in a rigorous way. Nowadays these theories are considered just mathematical toy models for understanding the problems of the true theories. In this work we show these theories are deeply connected with the deep belief networks (DBNs). Hinton et al. constructed DBNs by stacking several restricted Boltzmann machines (RBMs). The purpose of this construction is to obtain a network with a hierarchical structure (a deep learning architecture). An RBM corresponds to a certain spin glass, we argue that a DBN should correspond to an ultrametric spin glass. A model of such a system can be easily constructed by using p-adic numbers. In our approach, a p-adic SFT corresponds to a p-adic continuous DBN, and a discretization of this theory corresponds to a p-adic discrete DBN. We show that these last machines are universal approximators. In the p-adic framework, the correspondence between SFTs and NNs is not fully developed. We point out several open problems.
    Inversion of sea surface currents from satellite-derived SST-SSH synergies with 4DVarNets. (arXiv:2211.13059v2 [physics.ao-ph] UPDATED)
    Satellite altimetry is a unique way for direct observations of sea surface dynamics. This is however limited to the surface-constrained geostrophic component of sea surface velocities. Ageostrophic dynamics are however expected to be significant for horizontal scales below 100~km and time scale below 10~days. The assimilation of ocean general circulation models likely reveals only a fraction of this ageostrophic component. Here, we explore a learning-based scheme to better exploit the synergies between the observed sea surface tracers, especially sea surface height (SSH) and sea surface temperature (SST), to better inform sea surface currents. More specifically, we develop a 4DVarNet scheme which exploits a variational data assimilation formulation with trainable observations and {\em a priori} terms. An Observing System Simulation Experiment (OSSE) in a region of the Gulf Stream suggests that SST-SSH synergies could reveal sea surface velocities for time scales of 2.5-3.0 days and horizontal scales of 0.5$^\circ$-0.7$^\circ$, including a significant fraction of the ageostrophic dynamics ($\approx$ 47\%). The analysis of the contribution of different observation data, namely nadir along-track altimetry, wide-swath SWOT altimetry and SST data, emphasizes the role of SST features for the reconstruction at horizontal spatial scales ranging from \nicefrac{1}{20}$^\circ$ to \nicefrac{1}{4}$^\circ$.
    TarViS: A Unified Approach for Target-based Video Segmentation. (arXiv:2301.02657v1 [cs.CV])
    The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two.
    Seeking the Truth Beyond the Data. An Unsupervised Machine Learning Approach. (arXiv:2207.06949v3 [stat.ML] UPDATED)
    Clustering is an unsupervised machine learning methodology where unlabeled elements/objects are grouped together aiming to the construction of well-established clusters that their elements are classified according to their similarity. The goal of this process is to provide a useful aid to the researcher that will help her/him to identify patterns among the data. Dealing with large databases, such patterns may not be easily detectable without the contribution of a clustering algorithm. This article provides a deep description of the most widely used clustering methodologies accompanied by useful presentations concerning suitable parameter selection and initializations. Simultaneously, this article not only represents a review highlighting the major elements of examined clustering techniques but emphasizes the comparison of these algorithms' clustering efficiency based on 3 datasets, revealing their existing weaknesses and capabilities through accuracy and complexity, during the confrontation of discrete and continuous observations. The produced results help us extract valuable conclusions about the appropriateness of the examined clustering techniques in accordance with the dataset's size.
    When Spectral Modeling Meets Convolutional Networks: A Method for Discovering Reionization-era Lensed Quasars in Multi-band Imaging Data. (arXiv:2211.14543v2 [astro-ph.GA] UPDATED)
    Over the last two decades, around 300 quasars have been discovered at $z\gtrsim6$, yet only one has identified as being strongly gravitationally lensed. We explore a new approach -- enlarging the permitted spectral parameter space, while introducing a new spatial geometry veto criterion -- which is implemented via image-based deep learning. We first apply this approach to a systematic search for reionization-era lensed quasars, using data from the Dark Energy Survey, the Visible and Infrared Survey Telescope for Astronomy Hemisphere Survey, and the Wide-field Infrared Survey Explorer.Our search method consists of two main parts: (i) the preselection of the candidates based on their spectral energy distributions (SEDs) using catalog-level photometry and (ii) relative probabilities calculation of the candidates being a lens or some contaminant, utilizing a convolutional neural network (CNN) classification. The training data sets are constructed by painting deflected point-source lights over actual galaxy images, to generate realistic galaxy-quasar lens models, optimized to find systems with small image separations, i.e., Einstein radii of $\theta_\mathrm{E} \leq 1$ arcsec. Visual inspection is then performed for sources with CNN scores of $P_\mathrm{lens} > 0.1$, which leads us to obtain 36 newly selected lens candidates, which are awaiting spectroscopic confirmation. These findings show that automated SED modeling and deep learning pipelines, supported by modest human input, are a promising route for detecting strong lenses from large catalogs that can overcome the veto limitations of primarily dropout-based SED selection approaches.
    Utilising physics-guided deep learning to overvome data scarcity. (arXiv:2211.15664v2 [cs.LG] UPDATED)
    Deep learning (DL) relies heavily on data, and the quality of data influences its performance significantly. However, obtaining high-quality, well-annotated datasets can be challenging or even impossible in many real-world applications, such as structural risk estimation and medical diagnosis. This presents a significant barrier to the practical implementation of DL in these fields. Physics-guided deep learning (PGDL) is a novel type of DL that can integrate physics laws to train neural networks. This can be applied to any systems that are controlled or governed by physics laws, such as mechanics, finance and medical applications. It has been demonstrated that, with the additional information provided by physics laws, PGDL achieves great accuracy and generalisation in the presence of data scarcity. This review provides a detailed examination of PGDL and offers a structured overview of its use in addressing data scarcity across various fields, including physics, engineering and medical applications. Moreover, the review identifies the current limitations and opportunities for PGDL in relation to data scarcity and offers a thorough discussion on the future prospects of PGDL.
    Expanding boundaries of Gap Safe screening. (arXiv:2102.10846v2 [cs.LG] UPDATED)
    Sparse optimization problems are ubiquitous in many fields such as statistics, signal/image processing and machine learning. This has led to the birth of many iterative algorithms to solve them. A powerful strategy to boost the performance of these algorithms is known as safe screening: it allows the early identification of zero coordinates in the solution, which can then be eliminated to reduce the problem's size and accelerate convergence. In this work, we extend the existing Gap Safe screening framework by relaxing the global strong-concavity assumption on the dual cost function. Instead, we exploit local regularity properties, that is, strong concavity on well-chosen subsets of the domain. The non-negativity constraint is also integrated to the existing framework. Besides making safe screening possible to a broader class of functions that includes beta-divergences (e.g., the Kullback-Leibler divergence), the proposed approach also improves upon the existing Gap Safe screening rules on previously applicable cases (e.g., logistic regression). The proposed general framework is exemplified by some notable particular cases: logistic function, beta = 1.5 and Kullback-Leibler divergences. Finally, we showcase the effectiveness of the proposed screening rules with different solvers (coordinate descent, multiplicative-update and proximal gradient algorithms) and different data sets (binary classification, hyperspectral and count data).
    Two Wrongs Don't Make a Right: Combating Confirmation Bias in Learning with Label Noise. (arXiv:2112.02960v3 [cs.LG] UPDATED)
    Noisy labels damage the performance of deep networks. For robust learning, a prominent two-stage pipeline alternates between eliminating possible incorrect labels and semi-supervised training. However, discarding part of noisy labels could result in a loss of information, especially when the corruption has a dependency on data, e.g., class-dependent or instance-dependent. Moreover, from the training dynamics of a representative two-stage method DivideMix, we identify the domination of confirmation bias: pseudo-labels fail to correct a considerable amount of noisy labels, and consequently, the errors accumulate. To sufficiently exploit information from noisy labels and mitigate wrong corrections, we propose Robust Label Refurbishment (Robust LR) a new hybrid method that integrates pseudo-labeling and confidence estimation techniques to refurbish noisy labels. We show that our method successfully alleviates the damage of both label noise and confirmation bias. As a result, it achieves state-of-the-art performance across datasets and noise types, namely CIFAR under different levels of synthetic noise and Mini-WebVision and ANIMAL-10N with real-world noise.
    Fast and Low-Memory Deep Neural Networks Using Binary Matrix Factorization. (arXiv:2210.13468v2 [cs.LG] UPDATED)
    Despite the outstanding performance of deep neural networks in different applications, they are still computationally extensive and require a great number of memories. This motivates more research on reducing the resources required for implementing such networks. An efficient approach addressed for this purpose is matrix factorization, which has been shown to be effective on different networks. In this paper, we utilize binary matrix factorization and show its great efficiency in reducing the required number of resources in deep neural networks. In effect, this technique can lead to the practical implementation of such networks.
    First Go, then Post-Explore: the Benefits of Post-Exploration in Intrinsic Motivation. (arXiv:2212.03251v2 [cs.LG] UPDATED)
    Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper, we present a clear ablation study of post-exploration in a general intrinsically motivated goal exploration process (IMGEP) framework, that the Go-Explore paper did not show. We study the isolated potential of post-exploration, by turning it on and off within the same algorithm under both tabular and deep RL settings on both discrete navigation and continuous control tasks. Experiments on a range of MiniGrid and Mujoco environments show that post-exploration indeed helps IMGEP agents reach more diverse states and boosts their performance. In short, our work suggests that RL researchers should consider to use post-exploration in IMGEP when possible since it is effective, method-agnostic and easy to implement.
    Centralized Cooperative Exploration Policy for Continuous Control Tasks. (arXiv:2301.02375v1 [cs.LG])
    The deep reinforcement learning (DRL) algorithm works brilliantly on solving various complex control tasks. This phenomenal success can be partly attributed to DRL encouraging intelligent agents to sufficiently explore the environment and collect diverse experiences during the agent training process. Therefore, exploration plays a significant role in accessing an optimal policy for DRL. Despite recent works making great progress in continuous control tasks, exploration in these tasks has remained insufficiently investigated. To explicitly encourage exploration in continuous control tasks, we propose CCEP (Centralized Cooperative Exploration Policy), which utilizes underestimation and overestimation of value functions to maintain the capacity of exploration. CCEP first keeps two value functions initialized with different parameters, and generates diverse policies with multiple exploration styles from a pair of value functions. In addition, a centralized policy framework ensures that CCEP achieves message delivery between multiple policies, furthermore contributing to exploring the environment cooperatively. Extensive experimental results demonstrate that CCEP achieves higher exploration capacity. Empirical analysis shows diverse exploration styles in the learned policies by CCEP, reaping benefits in more exploration regions. And this exploration capacity of CCEP ensures it outperforms the current state-of-the-art methods across multiple continuous control tasks shown in experiments.
    How Powerful are K-hop Message Passing Graph Neural Networks. (arXiv:2205.13328v4 [cs.LG] UPDATED)
    The most popular design paradigm for Graph Neural Networks (GNNs) is 1-hop message passing -- aggregating information from 1-hop neighbors repeatedly. However, the expressive power of 1-hop message passing is bounded by the Weisfeiler-Lehman (1-WL) test. Recently, researchers extended 1-hop message passing to K-hop message passing by aggregating information from K-hop neighbors of nodes simultaneously. However, there is no work on analyzing the expressive power of K-hop message passing. In this work, we theoretically characterize the expressive power of K-hop message passing. Specifically, we first formally differentiate two different kernels of K-hop message passing which are often misused in previous works. We then characterize the expressive power of K-hop message passing by showing that it is more powerful than 1-WL and can distinguish almost all regular graphs. Despite the higher expressive power, we show that K-hop message passing still cannot distinguish some simple regular graphs and its expressive power is bounded by 3-WL. To further enhance its expressive power, we introduce a KP-GNN framework, which improves K-hop message passing by leveraging the peripheral subgraph information in each hop. We show that KP-GNN can distinguish many distance regular graphs which could not be distinguished by previous distance encoding or 3-WL methods. Experimental results verify the expressive power and effectiveness of KP-GNN. KP-GNN achieves competitive results across all benchmark datasets.
    Multi-Agent Reinforcement Learning for Fast-Timescale Demand Response of Residential Loads. (arXiv:2301.02593v1 [cs.MA])
    To integrate high amounts of renewable energy resources, electrical power grids must be able to cope with high amplitude, fast timescale variations in power generation. Frequency regulation through demand response has the potential to coordinate temporally flexible loads, such as air conditioners, to counteract these variations. Existing approaches for discrete control with dynamic constraints struggle to provide satisfactory performance for fast timescale action selection with hundreds of agents. We propose a decentralized agent trained with multi-agent proximal policy optimization with localized communication. We explore two communication frameworks: hand-engineered, or learned through targeted multi-agent communication. The resulting policies perform well and robustly for frequency regulation, and scale seamlessly to arbitrary numbers of houses for constant processing times.
    Signal Enhancement for Magnetic Navigation Challenge Problem. (arXiv:2007.12158v2 [cs.LG] UPDATED)
    Harnessing the magnetic field of the Earth for navigation has shown promise as a viable alternative to other navigation systems. A magnetic navigation system collects its own magnetic field data using a magnetometer and uses magnetic anomaly maps to determine the current location. The greatest challenge with magnetic navigation arises when the magnetic field measurements from the magnetometer encompass the magnetic field from not just the Earth, but also from the vehicle on which it is mounted. It is difficult to separate the Earth magnetic anomaly field, which is crucial for navigation, from the total magnetic field reading from the sensor. The purpose of this challenge problem is to decouple the Earth and aircraft magnetic signals in order to derive a clean signal from which to perform magnetic navigation. Baseline testing on the dataset has shown that the Earth magnetic field can be extracted from the total magnetic field using machine learning (ML). The challenge is to remove the aircraft magnetic field from the total magnetic field using a trained model. This challenge offers an opportunity to construct an effective model for removing the aircraft magnetic field from the dataset by using a scientific machine learning (SciML) approach comprised of an ML algorithm integrated with the physics of magnetic navigation.
    Understanding Urban Water Consumption using Remotely Sensed Data. (arXiv:2205.02932v2 [cs.CV] UPDATED)
    Urban metabolism is an active field of research that deals with the estimation of emissions and resource consumption from urban regions. The analysis could be carried out through a manual surveyor by the implementation of elegant machine learning algorithms. In this exploratory work, we estimate the water consumption by the buildings in the region captured by satellite imagery. To this end, we break our analysis into three parts: i) Identification of building pixels, given a satellite image, followed by ii) identification of the building type (residential/non-residential) from the building pixels, and finally iii) using the building pixels along with their type to estimate the water consumption using the average per unit area consumption for different building types as obtained from municipal surveys.
    Learning Invariant Rules from Data for Interpretable Anomaly Detection. (arXiv:2211.13577v2 [cs.LG] UPDATED)
    In the research area of anomaly detection, novel and promising methods are frequently developed. However, most existing studies exclusively focus on the detection task only and ignore the interpretability of the underlying models as well as their detection results. However, anomaly interpretation, which aims to provide explanation of why specific data instances are identified as anomalies, is an equally important task in many real-world applications. In this work, we propose a novel framework which synergizes several machine learning and data mining techniques to automatically learn invariant rules that are consistently satisfied in the training data. The learned invariant rules can provide explicit explanation of anomaly detection results and thus are extremely useful for subsequent decision-making regarding reported anomalies. Furthermore, our empirical evaluation shows that the proposed method can also achieve comparable or even better performance in terms of AUC and partial AUC on public benchmark datasets across various application domains compared with start-of-the-art anomaly detection models.
    Silent Killer: Optimizing Backdoor Trigger Yields a Stealthy and Powerful Data Poisoning Attack. (arXiv:2301.02615v1 [cs.CR])
    We propose a stealthy and powerful backdoor attack on neural networks based on data poisoning (DP). In contrast to previous attacks, both the poison and the trigger in our method are stealthy. We are able to change the model's classification of samples from a source class to a target class chosen by the attacker. We do so by using a small number of poisoned training samples with nearly imperceptible perturbations, without changing their labels. At inference time, we use a stealthy perturbation added to the attacked samples as a trigger. This perturbation is crafted as a universal adversarial perturbation (UAP), and the poison is crafted using gradient alignment coupled to this trigger. Our method is highly efficient in crafting time compared to previous methods and requires only a trained surrogate model without additional retraining. Our attack achieves state-of-the-art results in terms of attack success rate while maintaining high accuracy on clean samples.
    Superficial White Matter Analysis: An Efficient Point-cloud-based Deep Learning Framework with Supervised Contrastive Learning for Consistent Tractography Parcellation across Populations and dMRI Acquisitions. (arXiv:2207.08975v2 [eess.IV] UPDATED)
    Diffusion MRI tractography is an advanced imaging technique that enables in vivo mapping of the brain's white matter connections. White matter parcellation classifies tractography streamlines into clusters or anatomically meaningful tracts. It enables quantification and visualization of whole-brain tractography. Currently, most parcellation methods focus on the deep white matter (DWM), whereas fewer methods address the superficial white matter (SWM) due to its complexity. We propose a novel two-stage deep-learning-based framework, Superficial White Matter Analysis (SupWMA), that performs an efficient and consistent parcellation of 198 SWM clusters from whole-brain tractography. A point-cloud-based network is adapted to our SWM parcellation task, and supervised contrastive learning enables more discriminative representations between plausible streamlines and outliers for SWM. We train our model on a large-scale tractography dataset including streamline samples from labeled long- and medium-range (over 40mm) SWM clusters and anatomically implausible streamline samples, and we perform testing on six independently acquired datasets of different ages and health conditions (including neonates and patients with space-occupying brain tumors). Compared to several state-of-the-art methods, SupWMA obtains highly consistent and accurate SWM parcellation results on all datasets, showing good generalization across the lifespan in health and disease. In addition, the computational speed of SupWMA is much faster than other methods.
    Triple-stream Deep Metric Learning of Great Ape Behavioural Actions. (arXiv:2301.02642v1 [cs.CV])
    We propose the first metric learning system for the recognition of great ape behavioural actions. Our proposed triple stream embedding architecture works on camera trap videos taken directly in the wild and demonstrates that the utilisation of an explicit DensePose-C chimpanzee body part segmentation stream effectively complements traditional RGB appearance and optical flow streams. We evaluate system variants with different feature fusion techniques and long-tail recognition approaches. Results and ablations show performance improvements of ~12% in top-1 accuracy over previous results achieved on the PanAf-500 dataset containing 180,000 manually annotated frames across nine behavioural actions. Furthermore, we provide a qualitative analysis of our findings and augment the metric learning system with long-tail recognition techniques showing that average per class accuracy -- critical in the domain -- can be improved by ~23% compared to the literature on that dataset. Finally, since our embedding spaces are constructed as metric, we provide first data-driven visualisations of the great ape behavioural action spaces revealing emerging geometry and topology. We hope that the work sparks further interest in this vital application area of computer vision for the benefit of endangered great apes.
    Learning from a Biased Sample. (arXiv:2209.01754v2 [stat.ME] UPDATED)
    The empirical risk minimization approach to data-driven decision making assumes that we can learn a decision rule from training data drawn under the same conditions as the ones we want to deploy it in. However, in a number of settings, we may be concerned that our training sample is biased, and that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in simulations and a case study on ICU length of stay prediction.
    Does compressing activations help model parallel training?. (arXiv:2301.02654v1 [cs.LG])
    Large-scale Transformer models are known for their exceptional performance in a range of tasks, but training them can be difficult due to the requirement for communication-intensive model parallelism. One way to improve training speed is to compress the message size in communication. Previous approaches have primarily focused on compressing gradients in a data parallelism setting, but compression in a model-parallel setting is an understudied area. We have discovered that model parallelism has fundamentally different characteristics than data parallelism. In this work, we present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms - pruning-based, learning-based, and quantization-based - using a popular Transformer training framework. We evaluate these methods across more than 160 settings and 8 popular datasets, taking into account different hyperparameters, hardware, and both fine-tuning and pre-training stages. We also provide analysis when the model is scaled up. Finally, we provide insights for future development of model parallelism compression algorithms.
    SEQUENT: Towards Traceable Quantum Machine Learning using Sequential Quantum Enhanced Training. (arXiv:2301.02601v1 [quant-ph])
    Applying new computing paradigms like quantum computing to the field of machine learning has recently gained attention. However, as high-dimensional real-world applications are not yet feasible to be solved using purely quantum hardware, hybrid methods using both classical and quantum machine learning paradigms have been proposed. For instance, transfer learning methods have been shown to be successfully applicable to hybrid image classification tasks. Nevertheless, beneficial circuit architectures still need to be explored. Therefore, tracing the impact of the chosen circuit architecture and parameterization is crucial for the development of beneficially applicable hybrid methods. However, current methods include processes where both parts are trained concurrently, therefore not allowing for a strict separability of classical and quantum impact. Thus, those architectures might produce models that yield a superior prediction accuracy whilst employing the least possible quantum impact. To tackle this issue, we propose Sequential Quantum Enhanced Training (SEQUENT) an improved architecture and training process for the traceable application of quantum computing methods to hybrid machine learning. Furthermore, we provide formal evidence for the disadvantage of current methods and preliminary experimental results as a proof-of-concept for the applicability of SEQUENT.
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v5 [cs.LG] UPDATED)
    Different from existing Differential Privacy (DP) accountants, we introduce pro-active DP. Existing DP accountants keep track of how privacy budget has been spent while pro-active DP is a scheme that allows one to {\it a-priori} select parameters of DP-SGD based on a fixed privacy budget (in terms of $\epsilon$ and $\delta$) in such a way to optimize the anticipated utility (test accuracy) the most. To implement this idea, we show how to convert the classical DP moment accountant to a pro-active DP by exploiting the fact that it has a simple close form for computing spent privacy budget for a given interaction round. The DP moment accountant is introduced in context of DP-SGD and has the following property which is the key ingredient to build pro-active DP. In DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise with standard deviation $\sigma$ is added to local SGD updates after performing a clipping operation and normalizing with the clipping constant. We show that for attaining $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ for $\epsilon=\Omega(T/N^2)$, where $T$ is the total number of rounds and $N$ is equal to the size of the local data set. In many existing machine learning problems, $N$ is always large and $T=O(N)$. Hence, $\sigma$ becomes ``independent'' of any $T=O(N)$ choice with $\epsilon=\Omega(1/N)$. This means that our {\em $\sigma$ only depends on $N$ rather than $T$}. We show how this differential privacy characterization allows us to convert DP moment accountant to a pro-active DP.
    Provable Reset-free Reinforcement Learning by No-Regret Reduction. (arXiv:2301.02389v1 [cs.LG])
    Real-world reinforcement learning (RL) is often severely limited since typical RL algorithms heavily rely on the reset mechanism to sample proper initial states. In practice, the reset mechanism is expensive to implement due to the need for human intervention or heavily engineered environments. To make learning more practical, we propose a generic no-regret reduction to systematically design reset-free RL algorithms. Our reduction turns reset-free RL into a two-player game. We show that achieving sublinear regret in this two player game would imply learning a policy that has both sublinear performance regret and sublinear total number of resets in the original RL problem. This means that the agent eventually learns to perform optimally and avoid resets. By this reduction, we design an instantiation for linear Markov decision processes, which is the first provably correct reset-free RL algorithm to our knowledge.
    Text-Based Automatic Personality Prediction Using KGrAt-Net; A Knowledge Graph Attention Network Classifier. (arXiv:2205.13780v2 [cs.CL] UPDATED)
    Nowadays, a tremendous amount of human communications occur on Internet-based communication infrastructures, like social networks, email, forums, organizational communication platforms, etc. Indeed, the automatic prediction or assessment of individuals' personalities through their written or exchanged text would be advantageous to ameliorate their relationships. To this end, this paper aims to propose KGrAt-Net, which is a Knowledge Graph Attention Network text classifier. For the first time, it applies the knowledge graph attention network to perform Automatic Personality Prediction (APP), according to the Big Five personality traits. After performing some preprocessing activities, it first tries to acquire a knowing-full representation of the knowledge behind the concepts in the input text by building its equivalent knowledge graph. A knowledge graph collects interlinked descriptions of concepts, entities, and relationships in a machine-readable form. Practically, it provides a machine-readable cognitive understanding of concepts and semantic relationships among them. Then, applying the attention mechanism, it attempts to pay attention to the most relevant parts of the graph to predict the personality traits of the input text. We used 2,467 essays from the Essays Dataset. The results demonstrated that KGrAt-Net considerably improved personality prediction accuracies (up to 70.26% on average). Furthermore, KGrAt-Net also uses knowledge graph embedding to enrich the classification, which makes it even more accurate (on average, 72.41%) in APP.
    Approximate Real Symmetric Tensor Rank. (arXiv:2207.12529v3 [math.NA] UPDATED)
    We investigate the effect of an $\varepsilon$-room of perturbation tolerance on symmetric tensor decomposition. To be more precise, suppose a real symmetric $d$-tensor $f$, a norm $||.||$ on the space of symmetric $d$-tensors, and $\varepsilon >0$ are given. What is the smallest symmetric tensor rank in the $\varepsilon$-neighborhood of $f$? In other words, what is the symmetric tensor rank of $f$ after a clever $\varepsilon$-perturbation? We prove two theorems and develop three corresponding algorithms that give constructive upper bounds for this question. With expository goals in mind; we present probabilistic and convex geometric ideas behind our results, reproduce some known results, and point out open problems.
    "No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy. (arXiv:2301.02555v1 [cs.RO])
    Systems for language-guided human-robot interaction must satisfy two key desiderata for broad adoption: adaptivity and learning efficiency. Unfortunately, existing instruction-following agents cannot adapt, lacking the ability to incorporate online natural language supervision, and even if they could, require hundreds of demonstrations to learn even simple policies. In this work, we address these problems by presenting Language-Informed Latent Actions with Corrections (LILAC), a framework for incorporating and adapting to natural language corrections - "to the right," or "no, towards the book" - online, during execution. We explore rich manipulation domains within a shared autonomy paradigm. Instead of discrete turn-taking between a human and robot, LILAC splits agency between the human and robot: language is an input to a learned model that produces a meaningful, low-dimensional control space that the human can use to guide the robot. Each real-time correction refines the human's control space, enabling precise, extended behaviors - with the added benefit of requiring only a handful of demonstrations to learn. We evaluate our approach via a user study where users work with a Franka Emika Panda manipulator to complete complex manipulation tasks. Compared to existing learned baselines covering both open-loop instruction following and single-turn shared autonomy, we show that our corrections-aware approach obtains higher task completion rates, and is subjectively preferred by users because of its reliability, precision, and ease of use.
    Multi-treatment Effect Estimation from Biomedical Data. (arXiv:2112.07574v3 [cs.LG] UPDATED)
    This work proposes the M3E2, a multi-task learning neural network model to estimate the effect of multiple treatments. In contrast to existing methods, M3E2 can handle multiple treatment effects applied simultaneously to the same unit, continuous and binary treatments, and many covariates. We compared M3E2 with three baselines in three synthetic benchmark datasets: two with multiple treatments and one with one treatment. Our analysis showed that our method has superior performance, making more assertive estimations of the multiple treatment effects.
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v4 [cs.LG] UPDATED)
    Cellular sheaves equip graphs with a "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain competitive results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
    Low-rank Approximation of Linear Maps. (arXiv:1812.09042v2 [stat.ML] UPDATED)
    This work provides closed-form solutions and minimum achievable errors for a large class of low-rank approximation problems in Hilbert spaces. The proposed theorem generalizes to the case of bounded linear operators the previous results obtained in the finite dimensional case for the Frobenius norm. The theorem provides the basis for the design of tractable algorithms for kernel or continuous DMD.
    Interpretable Disease Prediction based on Reinforcement Path Reasoning over Knowledge Graphs. (arXiv:2010.08300v2 [cs.LG] UPDATED)
    Objective: To combine medical knowledge and medical data to interpretably predict the risk of disease. Methods: We formulated the disease prediction task as a random walk along a knowledge graph (KG). Specifically, we build a KG to record relationships between diseases and risk factors according to validated medical knowledge. Then, a mathematical object walks along the KG. It starts walking at a patient entity, which connects the KG based on the patient current diseases or risk factors and stops at a disease entity, which represents the predicted disease. The trajectory generated by the object represents an interpretable disease progression path of the given patient. The dynamics of the object are controlled by a policy-based reinforcement learning (RL) module, which is trained by electronic health records (EHRs). Experiments: We utilized two real-world EHR datasets to evaluate the performance of our model. In the disease prediction task, our model achieves 0.743 and 0.639 in terms of macro area under the curve (AUC) in predicting 53 circulation system diseases in the two datasets, respectively. This performance is comparable to the commonly used machine learning (ML) models in medical research. In qualitative analysis, our clinical collaborator reviewed the disease progression paths generated by our model and advocated their interpretability and reliability. Conclusion: Experimental results validate the proposed model in interpretably evaluating and optimizing disease prediction. Significance: Our work contributes to leveraging the potential of medical knowledge and medical data jointly for interpretable prediction tasks.
    Multifidelity Modeling for Physics-Informed Neural Networks (PINNs). (arXiv:2106.13361v2 [physics.comp-ph] UPDATED)
    Multifidelity simulation methodologies are often used in an attempt to judiciously combine low-fidelity and high-fidelity simulation results in an accuracy-increasing, cost-saving way. Candidates for this approach are simulation methodologies for which there are fidelity differences connected with significant computational cost differences. Physics-informed Neural Networks (PINNs) are candidates for these types of approaches due to the significant difference in training times required when different fidelities (expressed in terms of architecture width and depth as well as optimization criteria) are employed. In this paper, we propose a particular multifidelity approach applied to PINNs that exploits low-rank structure. We demonstrate that width, depth, and optimization criteria can be used as parameters related to model fidelity, and show numerical justification of cost differences in training due to fidelity parameter choices. We test our multifidelity scheme on various canonical forward PDE models that have been presented in the emerging PINNs literature.
    Incremental Without Replacement Sampling in Nonconvex Optimization. (arXiv:2007.07557v4 [cs.LG] UPDATED)
    Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting we prove that the sequence is attracted by solutions of optimality conditions of the problem.
    Neuro-DynaStress: Predicting Dynamic Stress Distributions in Structural Components. (arXiv:2301.02580v1 [physics.geo-ph])
    Structural components are typically exposed to dynamic loading, such as earthquakes, wind, and explosions. Structural engineers should be able to conduct real-time analysis in the aftermath or during extreme disaster events requiring immediate corrections to avoid fatal failures. As a result, it is crucial to predict dynamic stress distributions during highly disruptive events in real-time. Currently available high-fidelity methods, such as Finite Element Models (FEMs), suffer from their inherent high complexity and are computationally prohibitive. Therefore, to reduce computational cost while preserving accuracy, a deep learning model, Neuro-DynaStress, is proposed to predict the entire sequence of stress distribution based on finite element simulations using a partial differential equation (PDE) solver. The model was designed and trained to use the geometry, boundary conditions and sequence of loads as input and predict the sequences of high-resolution stress contours. The performance of the proposed framework is compared to finite element simulations using a PDE solver.
    Architect, Regularize and Replay (ARR): a Flexible Hybrid Approach for Continual Learning. (arXiv:2301.02464v1 [cs.LG])
    In recent years we have witnessed a renewed interest in machine learning methodologies, especially for deep representation learning, that could overcome basic i.i.d. assumptions and tackle non-stationary environments subject to various distributional shifts or sample selection biases. Within this context, several computational approaches based on architectural priors, regularizers and replay policies have been proposed with different degrees of success depending on the specific scenario in which they were developed and assessed. However, designing comprehensive hybrid solutions that can flexibly and generally be applied with tunable efficiency-effectiveness trade-offs still seems a distant goal. In this paper, we propose "Architect, Regularize and Replay" (ARR), an hybrid generalization of the renowned AR1 algorithm and its variants, that can achieve state-of-the-art results in classic scenarios (e.g. class-incremental learning) but also generalize to arbitrary data streams generated from real-world datasets such as CIFAR-100, CORe50 and ImageNet-1000.
    Quantum reinforcement learning in continuous action space. (arXiv:2012.10711v3 [quant-ph] UPDATED)
    Quantum reinforcement learning (QRL) is one promising algorithm proposed for near-term quantum devices. Early QRL proposals are effective at solving problems in discrete action space, but often suffer from the curse of dimensionality in the continuous domain due to discretization. To address this problem, we propose a quantum Deep Deterministic Policy Gradient algorithm that is efficient at solving both classical and quantum sequential decision problems in the continuous domain. As an application, our method can solve the quantum state-generation problem in a single shot: it only requires a one-shot optimization to generate a model that outputs the desired control sequence for arbitrary target state. In comparison, the standard quantum control method requires optimizing for each target state. Moreover, our method can also be used to physically reconstruct an unknown quantum state.
    Covid19 Reproduction Number: Credibility Intervals by Blockwise Proximal Monte Carlo Samplers. (arXiv:2203.09142v2 [cs.LG] UPDATED)
    Monitoring the Covid19 pandemic constitutes a critical societal stake that received considerable research efforts. The intensity of the pandemic on a given territory is efficiently measured by the reproduction number, quantifying the rate of growth of daily new infections. Recently, estimates for the time evolution of the reproduction number were produced using an inverse problem formulation with a nonsmooth functional minimization. While it was designed to be robust to the limited quality of the Covid19 data (outliers, missing counts), the procedure lacks the ability to output credibility interval based estimates. This remains a severe limitation for practical use in actual pandemic monitoring by epidemiologists that the present work aims to overcome by use of Monte Carlo sampling. After interpretation of the nonsmooth functional into a Bayesian framework, several sampling schemes are tailored to adjust the nonsmooth nature of the resulting posterior distribution. The originality of the devised algorithms stems from combining a Langevin Monte Carlo sampling scheme with Proximal operators. Performance of the new algorithms in producing relevant credibility intervals for the reproduction number estimates and denoised counts are compared. Assessment is conducted on real daily new infection counts made available by the Johns Hopkins University. The interest of the devised monitoring tools are illustrated on Covid19 data from several different countries.
    IMKGA-SM: Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling. (arXiv:2301.02445v1 [cs.AI])
    Multimodal knowledge graph link prediction aims to improve the accuracy and efficiency of link prediction tasks for multimodal data. However, for complex multimodal information and sparse training data, it is usually difficult to achieve interpretability and high accuracy simultaneously for most methods. To address this difficulty, a new model is developed in this paper, namely Interpretable Multimodal Knowledge Graph Answer Prediction via Sequence Modeling (IMKGA-SM). First, a multi-modal fine-grained fusion method is proposed, and Vgg16 and Optical Character Recognition (OCR) techniques are adopted to effectively extract text information from images and images. Then, the knowledge graph link prediction task is modelled as an offline reinforcement learning Markov decision model, which is then abstracted into a unified sequence framework. An interactive perception-based reward expectation mechanism and a special causal masking mechanism are designed, which ``converts" the query into an inference path. Then, an autoregressive dynamic gradient adjustment mechanism is proposed to alleviate the insufficient problem of multimodal optimization. Finally, two datasets are adopted for experiments, and the popular SOTA baselines are used for comparison. The results show that the developed IMKGA-SM achieves much better performance than SOTA baselines on multimodal link prediction datasets of different sizes.
    Topics as Entity Clusters: Entity-based Topics from Language Models and Graph Neural Networks. (arXiv:2301.02458v1 [cs.CL])
    Topic models aim to reveal the latent structure behind a corpus, typically conducted over a bag-of-words representation of documents. In the context of topic modeling, most vocabulary is either irrelevant for uncovering underlying topics or contains strong relationships with relevant concepts, impacting the interpretability of these topics. Furthermore, their limited expressiveness and dependency on language demand considerable computation resources. Hence, we propose a novel approach for cluster-based topic modeling that employs conceptual entities. Entities are language-agnostic representations of real-world concepts rich in relational information. To this end, we extract vector representations of entities from (i) an encyclopedic corpus using a language model; and (ii) a knowledge base using a graph neural network. We demonstrate that our approach consistently outperforms other state-of-the-art topic models across coherency metrics and find that the explicit knowledge encoded in the graph-based embeddings provides more coherent topics than the implicit knowledge encoded with the contextualized embeddings of language models.
    Deep leakage from gradients. (arXiv:2301.02621v1 [cs.CR])
    With the development of artificial intelligence technology, Federated Learning (FL) model has been widely used in many industries for its high efficiency and confidentiality. Some researchers have explored its confidentiality and designed some algorithms to attack training data sets, but these algorithms all have their own limitations. Therefore, most people still believe that local machine learning gradient information is safe and reliable. In this paper, an algorithm based on gradient features is designed to attack the federated learning model in order to attract more attention to the security of federated learning systems. In federated learning system, gradient contains little information compared with the original training data set, but this project intends to restore the original training image data through gradient information. Convolutional Neural Network (CNN) has excellent performance in image processing. Therefore, the federated learning model of this project is equipped with Convolutional Neural Network structure, and the model is trained by using image data sets. The algorithm calculates the virtual gradient by generating virtual image labels. Then the virtual gradient is matched with the real gradient to restore the original image. This attack algorithm is written in Python language, uses cat and dog classification Kaggle data sets, and gradually extends from the full connection layer to the convolution layer, thus improving the universality. At present, the average squared error between the data recovered by this algorithm and the original image information is approximately 5, and the vast majority of images can be completely restored according to the gradient information given, indicating that the gradient of federated learning system is not absolutely safe and reliable.
    BELLATREX: Building Explanations through a LocaLly AccuraTe Rule EXtractor. (arXiv:2203.15511v2 [cs.LG] UPDATED)
    Tree-ensemble algorithms, such as random forest, are effective machine learning methods popular for their flexibility, high performance, and robustness to overfitting. However, since multiple learners are combined, they are not as interpretable as a single decision tree. In this work we propose a novel method that is Building Explanations through a LocalLy AccuraTe Rule EXtractor (Bellatrex), and is able to explain the forest prediction for a given test instance with only a few diverse rules. Starting from the decision trees generated by a random forest, our method 1) pre-selects a subset of the rules used to make the prediction, 2) creates a vector representation of such rules, 3) projects them to a low-dimensional space, 4) clusters such representations to pick a rule from each cluster to explain the instance prediction. We test the effectiveness of Bellatrex on 89 real-world datasets and we demonstrate the validity of our method for binary classification, regression, multi-label classification and time-to-event tasks. To the best of our knowledge, it is the first time that an interpretability toolbox can handle all these tasks within the same framework. We also show that our extracted surrogate model can approximate the performance of the corresponding ensemble model in all considered tasks, while selecting only few trees from the whole forest. We also show that our proposed approach substantially outperforms other explainable methods in terms of predictive performance.
    Task Aware Feature Extraction Framework for Sequential Dependence Multi-Task Learning. (arXiv:2301.02494v1 [cs.LG])
    Multi-task learning (MTL) has been successfully implemented in many real-world applications, which aims to simultaneously solve multiple tasks with a single model. The general idea of multi-task learning is designing kinds of global parameter sharing mechanism and task-specific feature extractor to improve the performance of all tasks. However, sequential dependence between tasks are rarely studied but frequently encountered in e-commence online recommendation, e.g. impression, click and conversion on displayed product. There is few theoretical work on this problem and biased optimization object adopted in most MTL methods deteriorates online performance. Besides, challenge still remains in balancing the trade-off between various tasks and effectively learn common and specific representation. In this paper, we first analyze sequential dependence MTL from rigorous mathematical perspective and design a dependence task learning loss to provide an unbiased optimizing object. And we propose a Task Aware Feature Extraction (TAFE) framework for sequential dependence MTL, which enables to selectively reconstruct implicit shared representations from a sample-wise view and extract explicit task-specific information in an more efficient way. Extensive experiments on offline datasets and online A/B implementation demonstrate the effectiveness of our proposed TAFE.
    Machine Fault Classification using Hamiltonian Neural Networks. (arXiv:2301.02243v1 [cs.LG])
    A new approach is introduced to classify faults in rotating machinery based on the total energy signature estimated from sensor measurements. The overall goal is to go beyond using black-box models and incorporate additional physical constraints that govern the behavior of mechanical systems. Observational data is used to train Hamiltonian neural networks that describe the conserved energy of the system for normal and various abnormal regimes. The estimated total energy function, in the form of the weights of the Hamiltonian neural network, serves as the new feature vector to discriminate between the faults using off-the-shelf classification models. The experimental results are obtained using the MaFaulDa database, where the proposed model yields a promising area under the curve (AUC) of $0.78$ for the binary classification (normal vs abnormal) and $0.84$ for the multi-class problem (normal, and $5$ different abnormal regimes).
    Learning Personalized Brain Functional Connectivity of MDD Patients from Multiple Sites via Federated Bayesian Networks. (arXiv:2301.02423v1 [cs.LG])
    Identifying functional connectivity biomarkers of major depressive disorder (MDD) patients is essential to advance understanding of the disorder mechanisms and early intervention. However, due to the small sample size and the high dimension of available neuroimaging data, the performance of existing methods is often limited. Multi-site data could enhance the statistical power and sample size, while they are often subject to inter-site heterogeneity and data-sharing policies. In this paper, we propose a federated joint estimator, NOTEARS-PFL, for simultaneous learning of multiple Bayesian networks (BNs) with continuous optimization, to identify disease-induced alterations in MDD patients. We incorporate information shared between sites and site-specific information into the proposed federated learning framework to learn personalized BN structures by introducing the group fused lasso penalty. We develop the alternating direction method of multipliers, where in the local update step, the neuroimaging data is processed at each local site. Then the learned network structures are transmitted to the center for the global update. In particular, we derive a closed-form expression for the local update step and use the iterative proximal projection method to deal with the group fused lasso penalty in the global update step. We evaluate the performance of the proposed method on both synthetic and real-world multi-site rs-fMRI datasets. The results suggest that the proposed NOTEARS-PFL yields superior effectiveness and accuracy than the comparable methods.
    Deep learning for full-field ultrasonic characterization. (arXiv:2301.02378v1 [math.NA])
    This study takes advantage of recent advances in machine learning to establish a physics-based data analytic platform for distributed reconstruction of mechanical properties in layered components from full waveform data. In this vein, two logics, namely the direct inversion and physics-informed neural networks (PINNs), are explored. The direct inversion entails three steps: (i) spectral denoising and differentiation of the full-field data, (ii) building appropriate neural maps to approximate the profile of unknown physical and regularization parameters on their respective domains, and (iii) simultaneous training of the neural networks by minimizing the Tikhonov-regularized PDE loss using data from (i). PINNs furnish efficient surrogate models of complex systems with predictive capabilities via multitask learning where the field variables are modeled by neural maps endowed with (scaler or distributed) auxiliary parameters such as physical unknowns and loss function weights. PINNs are then trained by minimizing a measure of data misfit subject to the underlying physical laws as constraints. In this study, to facilitate learning from ultrasonic data, the PINNs loss adopts (a) wavenumber-dependent Sobolev norms to compute the data misfit, and (b) non-adaptive weights in a specific scaling framework to naturally balance the loss objectives by leveraging the form of PDEs germane to elastic-wave propagation. Both paradigms are examined via synthetic and laboratory test data. In the latter case, the reconstructions are performed at multiple frequencies and the results are verified by a set of complementary experiments highlighting the importance of verification and validation in data-driven modeling.
    Integrating Transformer and Autoencoder Techniques with Spectral Graph Algorithms for the Prediction of Scarcely Labeled Molecular Data. (arXiv:2211.06759v2 [cs.LG] UPDATED)
    In molecular and biological sciences, experiments are expensive, time-consuming, and often subject to ethical constraints. Consequently, one often faces the challenging task of predicting desirable properties from small data sets or scarcely-labeled data sets. Although transfer learning can be advantageous, it requires the existence of a related large data set. This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme are integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder, in order to deal with scarcely-labeled data sets. In addition, a consensus technique is detailed. The proposed models are validated using five benchmark data sets. We also provide a thorough comparison to other competing methods, such as support vector machines, random forests, and gradient boosting decision trees, which are known for their good performance on small data sets. The performances of various methods are analyzed using residue-similarity (R-S) scores and R-S indices. Extensive computational experiments and theoretical analysis show that the new models perform very well even when as little as 1% of the data set is used as labeled data.
    Start Small: Training Controllable Game Level Generators without Training Data by Learning at Multiple Sizes. (arXiv:2209.15052v2 [cs.LG] UPDATED)
    A level generator is a tool that generates game levels from noise. Training a generator without a dataset suffers from feedback sparsity, since it is unlikely to generate a playable level via random exploration. A common solution is shaped rewards, which guides the generator to achieve subgoals towards level playability, but they consume effort to design and require game-specific domain knowledge. This paper proposes a novel approach to train generators without datasets or shaped rewards by learning at multiple level sizes starting from small sizes and up to the desired sizes. The denser feedback at small sizes negates the need for shaped rewards. Additionally, the generators learn to build levels at various sizes, including sizes they were not trained for. We apply our approach to train recurrent auto-regressive generative flow networks (GFlowNets) for controllable level generation. We also adapt diversity sampling to be compatible with GFlowNets. The results show that our generators create diverse playable levels at various sizes for Sokoban, Zelda, and Danger Dave. When compared with controllable reinforcement learning level generators for Sokoban, the results show that our generators achieve better controllability and competitive diversity, while being 9x faster at training and level generation.
    Sequentially Controlled Text Generation. (arXiv:2301.02299v1 [cs.CL])
    While GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure. We study the problem of imposing structure on long-range text. We propose a novel controlled text generation task, sequentially controlled text generation, and identify a dataset, NewsDiscourse as a starting point for this task. We develop a sequential controlled text generation pipeline with generation and editing. We test different degrees of structural awareness and show that, in general, more structural awareness results in higher control-accuracy, grammaticality, coherency and topicality, approaching human-level writing performance.
    Restarts subject to approximate sharpness: A parameter-free and optimal scheme for first-order methods. (arXiv:2301.02268v1 [math.OC])
    Sharpness is an almost generic assumption in continuous optimization that bounds the distance from minima by objective function suboptimality. It leads to the acceleration of first-order methods via restarts. However, sharpness involves problem-specific constants that are typically unknown, and previous restart schemes reduce convergence rates. Moreover, such schemes are challenging to apply in the presence of noise or approximate model classes (e.g., in compressive imaging or learning problems), and typically assume that the first-order method used produces feasible iterates. We consider the assumption of approximate sharpness, a generalization of sharpness that incorporates an unknown constant perturbation to the objective function error. This constant offers greater robustness (e.g., with respect to noise or relaxation of model classes) for finding approximate minimizers. By employing a new type of search over the unknown constants, we design a restart scheme that applies to general first-order methods and does not require the first-order method to produce feasible iterates. Our scheme maintains the same convergence rate as when assuming knowledge of the constants. The rates of convergence we obtain for various first-order methods either match the optimal rates or improve on previously established rates for a wide range of problems. We showcase our restart scheme on several examples and point to future applications and developments of our framework and theory.
    Training trajectories, mini-batch losses and the curious role of the learning rate. (arXiv:2301.02312v1 [cs.LG])
    Stochastic gradient descent plays a fundamental role in nearly all applications of deep learning. However its efficiency and remarkable ability to converge to global minimum remains shrouded in mystery. The loss function defined on a large network with large amount of data is known to be non-convex. However, relatively little has been explored about the behavior of loss function on individual batches. Remarkably, we show that for ResNet the loss for any fixed mini-batch when measured along side SGD trajectory appears to be accurately modeled by a quadratic function. In particular, a very low loss value can be reached in just one step of gradient descent with large enough learning rate. We propose a simple model and a geometric interpretation that allows to analyze the relationship between the gradients of stochastic mini-batches and the full batch and how the learning rate affects the relationship between improvement on individual and full batch. Our analysis allows us to discover the equivalency between iterate aggregates and specific learning rate schedules. In particular, for Exponential Moving Average (EMA) and Stochastic Weight Averaging we show that our proposed model matches the observed training trajectories on ImageNet. Our theoretical model predicts that an even simpler averaging technique, averaging just two points a few steps apart, also significantly improves accuracy compared to the baseline. We validated our findings on ImageNet and other datasets using ResNet architecture.
    Evaluating counterfactual explanations using Pearl's counterfactual method. (arXiv:2301.02499v1 [stat.ML])
    Counterfactual explanations (CEs) are methods for generating an alternative scenario that produces a different desirable outcome. For example, if a student is predicted to fail a course, then counterfactual explanations can provide the student with alternate ways so that they would be predicted to pass. The applications are many. However, CEs are currently generated from machine learning models that do not necessarily take into account the true causal structure in the data. By doing this, bias can be introduced into the CE quantities. I propose in this study to test the CEs using Judea Pearl's method of computing counterfactuals which has thus far, surprisingly, not been seen in the counterfactual explanation (CE) literature. I furthermore evaluate these CEs on three different causal structures to show how the true underlying causal structure affects the CEs that are generated. This study presented a method of evaluating CEs using Pearl's method and it showed, (although using a limited sample size), that thirty percent of the CEs conflicted with those computed by Pearl's method. This shows that we cannot simply trust CEs and it is vital for us to know the true causal structure before we blindly compute counterfactuals using the original machine learning model.
    gRoMA: a Tool for Measuring Deep Neural Networks Global Robustness. (arXiv:2301.02288v1 [cs.LG])
    Deep neural networks (DNNs) are a state-of-the-art technology, capable of outstanding performance in many key tasks. However, it is challenging to integrate DNNs into safety-critical systems, such as those in the aerospace or automotive domains, due to the risk of adversarial inputs: slightly perturbed inputs that can cause the DNN to make grievous mistakes. Adversarial inputs have been shown to plague even modern DNNs; and so the risks they pose must be measured and mitigated to allow the safe deployment of DNNs in safety-critical systems. Here, we present a novel and scalable tool called gRoMA, which uses a statistical approach for formally measuring the global categorial robustness of a DNN - i.e., the probability of randomly encountering an adversarial input for a specific output category. Our tool operates on pre-trained, black-box classification DNNs. It randomly generates input samples that belong to an output category of interest, measures the DNN's susceptibility to adversarial inputs around these inputs, and then aggregates the results to infer the overall global robustness of the DNN up to some small bounded error. For evaluation purposes, we used gRoMA to measure the global robustness of the widespread Densenet DNN model over the CIFAR10 dataset and our results exposed significant gaps in the robustness of the different output categories. This experiment demonstrates the scalability of the new approach and showcases its potential for allowing DNNs to be deployed within critical systems of interest.
    Extreme Q-Learning: MaxEnt RL without Entropy. (arXiv:2301.02328v1 [cs.LG])
    Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains with an infinite number of possible actions. In this work, we introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT), drawing inspiration from Economics. By doing so, we avoid computing Q-values using out-of-distribution actions which is often a substantial source of error. Our key insight is to introduce an objective that directly estimates the optimal soft-value functions (LogSumExp) in the maximum entropy RL setting without needing to sample from a policy. Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms, that do not explicitly require access to a policy or its entropy. Our method obtains consistently strong performance in the D4RL benchmark, outperforming prior works by 10+ points on some tasks while offering moderate improvements over SAC and TD3 on online DM Control tasks.
    Deep Latent Variable Models for Semi-supervised Paraphrase Generation. (arXiv:2301.02275v1 [cs.CL])
    This paper explores deep latent variable models for semi-supervised paraphrase generation, where the missing target pair is modelled as a latent paraphrase sequence. We present a novel unsupervised model named variational sequence auto-encoding reconstruction (VSAR), which performs latent sequence inference given an observed text. To leverage information from text pairs, we introduce a supervised model named dual directional learning (DDL). Combining VSAR with DDL (DDL+VSAR) enables us to conduct semi-supervised learning; however, the combined model suffers from a cold-start problem. To combat this issue, we propose to deal with better weight initialisation, leading to a two-stage training scheme named knowledge reinforced training. Our empirical evaluations suggest that the combined model yields competitive performance against the state-of-the-art supervised baselines on complete data. Furthermore, in scenarios where only a fraction of the labelled pairs are available, our combined model consistently outperforms the strong supervised model baseline (DDL and Transformer) by a significant margin.
    MSCDA: Multi-level Semantic-guided Contrast Improves Unsupervised Domain Adaptation for Breast MRI Segmentation in Small Datasets. (arXiv:2301.02554v1 [q-bio.QM])
    Deep learning (DL) applied to breast tissue segmentation in magnetic resonance imaging (MRI) has received increased attention in the last decade, however, the domain shift which arises from different vendors, acquisition protocols, and biological heterogeneity, remains an important but challenging obstacle on the path towards clinical implementation. Recently, unsupervised domain adaptation (UDA) methods have attempted to mitigate this problem by incorporating self-training with contrastive learning. To better exploit the underlying semantic information of the image at different levels, we propose a Multi-level Semantic-guided Contrastive Domain Adaptation (MSCDA) framework to align the feature representation between domains. In particular, we extend the contrastive loss by incorporating pixel-to-pixel, pixel-to-centroid, and centroid-to-centroid contrasts to integrate semantic information of images. We utilize a category-wise cross-domain sampling strategy to sample anchors from target images and build a hybrid memory bank to store samples from source images. Two breast MRI datasets were retrospectively collected: The source dataset contains non-contrast MRI examinations from 11 healthy volunteers and the target dataset contains contrast-enhanced MRI examinations of 134 invasive breast cancer patients. We set up experiments from source T2W image to target dynamic contrast-enhanced (DCE)-T1W image (T2W-to-T1W) and from source T1W image to target T2W image (T1W-to-T2W). The proposed method achieved Dice similarity coefficient (DSC) of 89.2\% and 84.0\% in T2W-to-T1W and T1W-to-T2W, respectively, outperforming state-of-the-art methods. Notably, good performance is still achieved with a smaller source dataset, proving that our framework is label-efficient.
    Conformal Loss-Controlling Prediction. (arXiv:2301.02424v1 [cs.LG])
    Conformal prediction is a learning framework controlling prediction coverage of prediction sets, which can be built on any learning algorithm for point prediction. This work proposes a learning framework named conformal loss-controlling prediction, which extends conformal prediction to the situation where the value of a loss function needs to be controlled. Different from existing works about risk-controlling prediction sets and conformal risk control with the purpose of controlling the expected values of loss functions, the proposed approach in this paper focuses on the loss for any test object, which is an extension of conformal prediction from miscoverage loss to some general loss. The controlling guarantee is proved under the assumption of exchangeability of data in finite-sample cases and the framework is tested empirically for classification with a class-varying loss and statistical postprocessing of numerical weather forecasting applications, which are introduced as point-wise classification and point-wise regression problems. All theoretical analysis and experimental results confirm the effectiveness of our loss-controlling approach.
    Combined mechanistic and machine learning method for construction of oil reservoir permeability map consistent with well test measurements. (arXiv:2301.02585v1 [physics.geo-ph])
    We propose a new method for construction of the absolute permeability map consistent with the interpreted results of well logging and well test measurements in oil reservoirs. Nadaraya-Watson kernel regression is used to approximate two-dimensional spatial distribution of the rock permeability. Parameters of the kernel regression are tuned by solving the optimization problem in which, for each well placed in an oil reservoir, we minimize the difference between the actual and predicted values of (i) absolute permeability at the well location (from well logging); (ii) absolute integral permeability of the domain around the well and (iii) skin factor (from well tests). Inverse problem is solved via multiple solutions to forward problems, in which we estimate the integral permeability of reservoir surrounding a well and the skin factor by the surrogate model. The last one is developed using an artificial neural network trained on the physics-based synthetic dataset generated using the procedure comprising the numerical simulation of bottomhole pressure decline curve in reservoir simulator followed by its interpretation using a semi-analytical reservoir model. The developed method for reservoir permeability map construction is applied to the available reservoir model (Egg Model) with highly heterogeneous permeability distribution due to the presence of highly-permeable channels. We showed that the constructed permeability map is hydrodynamically similar to the original one. Numerical simulations of production in the reservoir with constructed and original permeability maps are quantitatively similar in terms of the pore pressure and fluid saturations distribution at the end of the simulation period. Moreover, we obtained an good match between the obtained results of numerical simulations in terms of the flow rates and total volumes of produced oil, water and injected water.
    Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction. (arXiv:2103.08280v5 [math.OC] UPDATED)
    In this paper, we study the lower complexity bounds for finite-sum optimization problems, where the objective is the average of $n$ individual component functions. We consider Proximal Incremental First-order (PIFO) algorithms which have access to the gradient and proximal oracles for each component function. To incorporate loopless methods, we also allow PIFO algorithms to obtain the full gradient infrequently. We develop a novel approach to constructing the hard instances, which partitions the tridiagonal matrix of classical examples into $n$ groups. This construction is friendly to the analysis of PIFO algorithms. Based on this construction, we establish the lower complexity bounds for finite-sum minimax optimization problems when the objective is convex-concave or nonconvex-strongly-concave and the class of component functions is $L$-average smooth. Most of these bounds are nearly matched by existing upper bounds up to log factors. We can also derive similar lower bounds for finite-sum minimization problems as previous work under both smoothness and average smoothness assumptions. Our lower bounds imply that proximal oracles for smooth functions are not much more powerful than gradient oracles.
    Reversibility of elliptical slice sampling revisited. (arXiv:2301.02426v1 [math.ST])
    We discuss the well-definedness of elliptical slice sampling, a Markov chain approach for approximate sampling of posterior distributions introduced by Murray, Adams and MacKay 2010. We point to a regularity requirement and provide an alternative proof of the reversibility property. In particular, this guarantees the correctness of the slice sampling scheme also on infinite-dimensional separable Hilbert spaces.
    Graph Contrastive Learning for Multi-omics Data. (arXiv:2301.02242v1 [q-bio.GN])
    Advancements in technologies related to working with omics data require novel computation methods to fully leverage information and help develop a better understanding of human diseases. This paper studies the effects of introducing graph contrastive learning to help leverage graph structure and information to produce better representations for downstream classification tasks for multi-omics datasets. We present a learnining framework named Multi-Omics Graph Contrastive Learner(MOGCL) which outperforms several aproaches for integrating multi-omics data for supervised learning tasks. We show that pre-training graph models with a contrastive methodology along with fine-tuning it in a supervised manner is an efficient strategy for multi-omics data classification.  ( 2 min )
    A Data-Driven Gaussian Process Filter for Electrocardiogram Denoising. (arXiv:2301.02607v1 [eess.SP])
    Objective: Gaussian Processes (GP)-based filters, which have been effectively used for various applications including electrocardiogram (ECG) filtering can be computationally demanding and the choice of their hyperparameters is typically ad hoc. Methods: We develop a data-driven GP filter to address both issues, using the notion of the ECG phase domain -- a time-warped representation of the ECG beats onto a fixed number of samples and aligned R-peaks, which is assumed to follow a Gaussian distribution. Under this assumption, the computation of the sample mean and covariance matrix is simplified, enabling an efficient implementation of the GP filter in a data-driven manner, with no ad hoc hyperparameters. The proposed filter is evaluated and compared with a state-of-the-art wavelet-based filter, on the PhysioNet QT Database. The performance is evaluated by measuring the signal-to-noise ratio (SNR) improvement of the filter at SNR levels ranging from -5 to 30dB, in 5dB steps, using additive noise. For a clinical evaluation, the error between the estimated QT-intervals of the original and filtered signals is measured and compared with the benchmark filter. Results: It is shown that the proposed GP filter outperforms the benchmark filter for all the tested noise levels. It also outperforms the state-of-the-art filter in terms of QT-interval estimation error bias and variance. Conclusion: The proposed GP filter is a versatile technique for preprocessing the ECG in clinical and research applications, is applicable to ECG of arbitrary lengths and sampling frequencies, and provides confidence intervals for its performance.  ( 2 min )
    Deep Biological Pathway Informed Pathology-Genomic Multimodal Survival Prediction. (arXiv:2301.02383v1 [q-bio.QM])
    The integration of multi-modal data, such as pathological images and genomic data, is essential for understanding cancer heterogeneity and complexity for personalized treatments, as well as for enhancing survival predictions. Despite the progress made in integrating pathology and genomic data, most existing methods cannot mine the complex inter-modality relations thoroughly. Additionally, identifying explainable features from these models that govern preclinical discovery and clinical prediction is crucial for cancer diagnosis, prognosis, and therapeutic response studies. We propose PONET- a novel biological pathway-informed pathology-genomic deep model that integrates pathological images and genomic data not only to improve survival prediction but also to identify genes and pathways that cause different survival rates in patients. Empirical results on six of The Cancer Genome Atlas (TCGA) datasets show that our proposed method achieves superior predictive performance and reveals meaningful biological interpretations. The proposed method establishes insight into how to train biologically informed deep networks on multimodal biomedical data which will have general applicability for understanding diseases and predicting response and resistance to treatment.  ( 2 min )
    TWR-MCAE: A Data Augmentation Method for Through-the-Wall Radar Human Motion Recognition. (arXiv:2301.02488v1 [eess.SP])
    To solve the problems of reduced accuracy and prolonging convergence time of through-the-wall radar (TWR) human motion due to wall attenuation, multipath effect, and system interference, we propose a multilink auto-encoding neural network (TWR-MCAE) data augmentation method. Specifically, the TWR-MCAE algorithm is jointly constructed by a singular value decomposition (SVD)-based data preprocessing module, an improved coordinate attention module, a compressed sensing learnable iterative shrinkage threshold reconstruction algorithm (LISTA) module, and an adaptive weight module. The data preprocessing module achieves wall clutter, human motion features, and noise subspaces separation. The improved coordinate attention module achieves clutter and noise suppression. The LISTA module achieves human motion feature enhancement. The adaptive weight module learns the weights and fuses the three subspaces. The TWR-MCAE can suppress the low-rank characteristics of wall clutter and enhance the sparsity characteristics in human motion at the same time. It can be linked before the classification step to improve the feature extraction capability without adding other prior knowledge or recollecting more data. Experiments show that the proposed algorithm gets a better peak signal-to-noise ratio (PSNR), which increases the recognition accuracy and speeds up the training process of the back-end classifiers.  ( 2 min )
    DANLIP: Deep Autoregressive Networks for Locally Interpretable Probabilistic Forecasting. (arXiv:2301.02332v1 [cs.LG])
    Despite the high performance of neural network-based time series forecasting methods, the inherent challenge in explaining their predictions has limited their applicability in certain application areas. Due to the difficulty in identifying causal relationships between the input and output of such black-box methods, they rarely have been adopted in domains such as legal and medical fields in which the reliability and interpretability of the results can be essential. In this paper, we propose \model, a novel deep learning-based probabilistic time series forecasting architecture that is intrinsically interpretable. We conduct experiments with multiple datasets and performance metrics and empirically show that our model is not only interpretable but also provides comparable performance to state-of-the-art probabilistic time series forecasting methods. Furthermore, we demonstrate that interpreting the parameters of the stochastic processes of interest can provide useful insights into several application areas.
    Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation. (arXiv:2301.02262v1 [eess.AS])
    This paper proposes singing voice synthesis (SVS) based on frame-level sequence-to-sequence models considering vocal timing deviation. In SVS, it is essential to synchronize the timing of singing with temporal structures represented by scores, taking into account that there are differences between actual vocal timing and note start timing. In many SVS systems including our previous work, phoneme-level score features are converted into frame-level ones on the basis of phoneme boundaries obtained by external aligners to take into account vocal timing deviations. Therefore, the sound quality is affected by the aligner accuracy in this system. To alleviate this problem, we introduce an attention mechanism with frame-level features. In the proposed system, the attention mechanism absorbs alignment errors in phoneme boundaries. Additionally, we evaluate the system with pseudo-phoneme-boundaries defined by heuristic rules based on musical scores when there is no aligner. The experimental results show the effectiveness of the proposed system.  ( 2 min )
    GNN-based Passenger Request Prediction. (arXiv:2301.02515v1 [cs.LG])
    Passenger request prediction is essential for operations planning, control, and management in ride-sharing platforms. While the demand prediction problem has been studied extensively, the Origin-Destination (OD) flow prediction of passengers has received less attention from the research community. This paper develops a Graph Neural Network framework along with the Attention Mechanism to predict the OD flow of passengers. The proposed framework exploits various linear and non-linear dependencies that arise among requests originating from different locations and captures the repetition pattern and the contextual data of that place. Moreover, the optimal size of the grid cell that covers the road network and preserves the complexity and accuracy of the model is determined. Extensive simulations are conducted to examine the characteristics of our proposed approach and its various components. The results show the superior performance of our proposed model compared to the existing baselines.  ( 2 min )
    TrojanPuzzle: Covertly Poisoning Code-Suggestion Models. (arXiv:2301.02344v1 [cs.CR])
    With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools, based on large language models, are typically trained on massive corpora of code mined from unvetted public sources. As a result, these models are susceptible to data poisoning attacks where an adversary manipulates the model's training or fine-tuning phases by injecting malicious data. Poisoning attacks could be designed to influence the model's suggestions at run time for chosen contexts, such as inducing the model into suggesting insecure code payloads. To achieve this, prior poisoning attacks explicitly inject the insecure code payload into the training data, making the poisoning data detectable by static analysis tools that can remove such malicious data from the training set. In this work, we demonstrate two novel data poisoning attacks, COVERT and TROJANPUZZLE, that can bypass static analysis by planting malicious poisoning data in out-of-context regions such as docstrings. Our most novel attack, TROJANPUZZLE, goes one step further in generating less suspicious poisoning data by never including certain (suspicious) parts of the payload in the poisoned data, while still inducing a model that suggests the entire payload when completing code (i.e., outside docstrings). This makes TROJANPUZZLE robust against signature-based dataset-cleansing methods that identify and filter out suspicious sequences from the training data. Our evaluation against two model sizes demonstrates that both COVERT and TROJANPUZZLE have significant implications for how practitioners should select code used to train or tune code-suggestion models.  ( 2 min )
    Valid P-Value for Deep Learning-Driven Salient Region. (arXiv:2301.02437v1 [stat.ML])
    Various saliency map methods have been proposed to interpret and explain predictions of deep learning models. Saliency maps allow us to interpret which parts of the input signals have a strong influence on the prediction results. However, since a saliency map is obtained by complex computations in deep learning models, it is often difficult to know how reliable the saliency map itself is. In this study, we propose a method to quantify the reliability of a salient region in the form of p-values. Our idea is to consider a salient region as a selected hypothesis by the trained deep learning model and employ the selective inference framework. The proposed method can provably control the probability of false positive detections of salient regions. We demonstrate the validity of the proposed method through numerical examples in synthetic and real datasets. Furthermore, we develop a Keras-based framework for conducting the proposed selective inference for a wide class of CNNs without additional implementation cost.  ( 2 min )
    Multi-Genre Music Transformer -- Composing Full Length Musical Piece. (arXiv:2301.02385v1 [cs.SD])
    In the task of generating music, the art factor plays a big role and is a great challenge for AI. Previous work involving adversarial training to produce new music pieces and modeling the compatibility of variety in music (beats, tempo, musical stems) demonstrated great examples of learning this task. Though this was limited to generating mashups or learning features from tempo and key distributions to produce similar patterns. Compound Word Transformer was able to represent music generation task as a sequence generation challenge involving musical events defined by compound words. These musical events give a more accurate description of notes progression, chord change, harmony and the art factor. The objective of the project is to implement a Multi-Genre Transformer which learns to produce music pieces through more adaptive learning process involving more challenging task where genres or form of the composition is also considered. We built a multi-genre compound word dataset, implemented a linear transformer which was trained on this dataset. We call this Multi-Genre Transformer, which was able to generate full length new musical pieces which is diverse and comparable to original tracks. The model trains 2-5 times faster than other models discussed.  ( 2 min )
    Myths and Legends in High-Performance Computing. (arXiv:2301.02432v1 [cs.DC])
    In this humorous and thought provoking article, we discuss certain myths and legends that are folklore among members of the high-performance computing community. We collected those myths from conversations at conferences and meetings, product advertisements, papers, and other communications such as tweets, blogs, and news articles within (and beyond) our community. We believe they represent the zeitgeist of the current era of massive change, driven by the end of many scaling laws such as Dennard scaling and Moore's law. While some laws end, new directions open up, such as algorithmic scaling or novel architecture research. However, these myths are rarely based on scientific facts but often on some evidence or argumentation. In fact, we believe that this is the very reason for the existence of many myths and why they cannot be answered clearly. While it feels like there should be clear answers for each, some may remain endless philosophical debates such as the question whether Beethoven was better than Mozart. We would like to see our collection of myths as a discussion of possible new directions for research and industry investment.  ( 2 min )

  • Open

    Leveraging AI To Build Apps
    submitted by /u/emanresu_2017 [link] [comments]  ( 46 min )
    I programmed my phone to do standup comedy
    submitted by /u/iusereditt [link] [comments]  ( 46 min )
    AI Dream 147 - Unbelievable MINDBLOW AI Video - MASTERPIECE 30min
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    Is there a free AI writter that doesn't have a word limit
    Open AI playground is fine in all but it has its limits. It can only write so much. Anything stronger then Open AI but free to use? submitted by /u/Zan_korida [link] [comments]  ( 46 min )
    I built ColabRating, a site where you can show off your Google Colab.
    Colabs are how I got into AI, and I think they're a great place to start - looking at other people's, and building your own. I couldn't find any rating sites for Colabs, so I made one. Any suggestions as to how to make it better would be appreciated. I'll add categories as and when there are enough colabs on there to need them. https://colabrating.com/ submitted by /u/andysurtees [link] [comments]  ( 47 min )
    What is involved in training a language model like ChatGPT?
    ChatGPT is ok, but it isn't trained on specific niches (Greek history). You could train it on this data though. What all is involved in training a model like this to be able to talk to you as does ChatGPT but using data you give it? Is there an existing program you can use to do this? submitted by /u/eratonnn [link] [comments]  ( 49 min )
    Intelligent Document Processing System
    Hi! I am trying to learn about Intelligent Document Processing Need to build a automation tool to find some words in documents in pdf/word format Make a check list about what was found These documents are digitalizations from a scanner Some documents have 900 pages or more, and some have bad quality digitalizations from decades ago (probably need to setup a database for each word) I know there's several which can do that job, but I am looking for something more accessible, these available are too expensive targeting enterprises Any guidance would be very helpful! submitted by /u/ThereisNothingHeeree [link] [comments]  ( 51 min )
    Baidu Create 2022: AI Developer Conference
    Hey all r/artificial, I wanted to invite you to join us for Baidu Create, our annual AI developer conference. We'll be exploring the latest developments in AI technology and innovation, and discussing how we can shape the future of AI together with a global community of creators. You can watch the conference live on Baidu YouTube: (https://www.youtube.com/watch?v=LlydjVDYb3A) at 10:00 pm PDT on January 9th. As a sneak peek, here are some of the tech innovations that will be unveiled at Baidu Create: * A band of virtual persons * Big models for generative AI * A metaverse built in 40 days * Voice interaction without echo * Connecting cars & roads with a shared perception * Generative search engine * Everyone can quantum * Scientific computing * Next-gen computing for the future cloud https://preview.redd.it/s7e0gdvoa2ba1.png?width=1200&format=png&auto=webp&s=b5e2f6377ae01b8e5189a814355a0b5f798d58d4 submitted by /u/trcytony [link] [comments]  ( 47 min )
    ChatGPT as a Cheating Tool
    submitted by /u/BackgroundResult [link] [comments]  ( 49 min )
    What is GPTZero, the ChatGPT Watermark Alternative?
    submitted by /u/BackgroundResult [link] [comments]  ( 48 min )
    Searching for a medical Q-A Dataset to categroze answers given by patients in response to an AI question
    Hello everyone, ​ i recently started a project, for which I want to categorize the answers given by a human patient to a question asked by an AI to a category. Example: AI: Are you currently smoking? Patient: No, but I smoked until last september. I t was quite a decent amount, but I stopped, and havnt touched a cigarette since. Detected Category: No ​ I have searched far and wide for a dataset containing medical consultations with data annotated in that way, but havent found any. ​ How would you think is the best way to adress something like this without having to start collecting data? Thank you submitted by /u/Fabianslife [link] [comments]  ( 50 min )
    Microsoft to integrate ChatGPT into Office products
    submitted by /u/Number_5_alive [link] [comments]  ( 46 min )
    Stable Diffusion PC INSTALLATION 2023 UPDATE! AI Art For BEGINNERS!
    submitted by /u/PuppetHere [link] [comments]  ( 49 min )
    5 Growing Libraries in Python for Causality Analysis
    submitted by /u/pasticciociccio [link] [comments]  ( 49 min )
    Google AI Introduces Muse: A Text-To-Image Generation/Editing Model via Masked Generative Transformers
    submitted by /u/ai-lover [link] [comments]  ( 47 min )
    Researchers at Stanford have developed an Artificial Intelligence (AI) Model, SUMMON, that can generate Multi-Object Scenes from a Sequence of Human Interaction
    submitted by /u/ai-lover [link] [comments]  ( 47 min )
    What happens to OpenAI if it reaches $29 billion in valuation
    submitted by /u/bendee983 [link] [comments]  ( 49 min )
    ChatGPT is just the beginning: How advanced AI is set to enter a new era
    submitted by /u/moviesdusk [link] [comments]  ( 49 min )
    I asked ChatGPT to cast countries as villains in a movie
    submitted by /u/EvilCorpGame [link] [comments]  ( 51 min )
    Neural Search vs. Google Search: What's the difference?
    I read an article about neural search and for those who don’t know, it’s a way for computers to find stuff using these special programs called neural networks. It can be used in lots of different ways, like searching the web, or helping you find things on your computer. It can also find things that are close to what we're looking for. It can even search through images, audio, and video. Sometimes it's even better to use a combination of Neural Search and other methods to get the best results. Sounds a lot like something Google Search would do? But from what I understand, Google uses "artificial neural networks" to try and understand what we are looking for and find the best websites for it. But I think Google also uses lots of other ways to help us find what we are looking for, so it's not just using the neural networks. Anyone know the difference? submitted by /u/gabuzgab [link] [comments]  ( 51 min )
    Summate.it - Quickly summarise web articles with OpenAI
    submitted by /u/fivefilters [link] [comments]  ( 51 min )
    Where to get started with AI?
    As I've been browsing Twitter and various imageboards/forums lately I've been seeing all the rage about AI recently and just feel so disconnected from it. Basically, where do I go to get started with AI and the software to do the things I've seen around the internet like, image generation or say writing an essay? Thanks. submitted by /u/Decryptionite [link] [comments]  ( 46 min )
  • Open

    [P] Built an at-cost, pay per second, open-source API for Tortoise text-to-speech (best I've heard!)
    Improve Tortoise TTS by 30% inference speed, and packaged it up as a hosted API that charges per-second. All code is open-sourced: https://github.com/metavoicexyz/tortoise-tts-modal-api, https://github.com/metavoicexyz/tortoise-tts It can be used via a UI on: https://tts.themetavoice.xyz There are more details here: https://twitter.com/vatsal\_aggarwal/status/1612536547248836608?s=20 submitted by /u/Apprehensive-Tax-214 [link] [comments]  ( 57 min )
    [D] Do cloud gpu's run while my laptop is switched off?
    This might sound like a dumb question but do cloud gpu's that you rent still train data when I switch off my laptop since the GPU is still running somewhere in the cloud? Or does it switch off when I close my laptop automatically? Also would anyone know any cheap GPU cloud websites for training a tensorflow mnt model on 12 million Reddit comments and replies. (Idea from sentdex's Reddit chatbot tutorial). Thanks In advance :) submitted by /u/smileawe3211 [link] [comments]  ( 61 min )
    [D] Maarten Grootendorst: BERTopic, Data Science, Psychology | Learning from ML Episode 1
    This is the first episode of a new podcast on machine learning featuring Maarten Grootendorst. Maarten Grootendorst: BERTopic, Data Science, Psychology | Learning from Machine Learning #1 submitted by /u/slam0077 [link] [comments]  ( 56 min )
    [D] Looking for github package testing many decision tree models - it exists but I can't find it in my browser history
    Hi everyone, A couple of months ago I saw a github with a package testing many different decision tree models on the same (user provided) dataset, really fast, in python ; the goal is to select the optimal one programmatically. I can't remember if I discovered that package on hacker news, or on github trending. No luck sifting through my browser history. Would any of you recognise this description and know the package ? Help appreciated ! submitted by /u/Maaaaxime [link] [comments]  ( 73 min )
    [D] Am I reducing the dimensionality of the problem by using a categorial feature but with high cardinality?
    In practice, I am working with chemical formulations with thousands of ingredients. Using each ingredient (like a one-hot-encoding) would explode the dimensionality of the problem. I am thinking about grouping all these ingredients into their "functional role" (20 or so) . If so, I could greately reduce the number of features, but the cardinality would be high for each feature. Did the dimensionality really go down from a thousand to 20? My intution tells me that 20 should be multiplied by all the cardinalities of each feature, and such, I haven't made much progress in reducing dimensionality. Does anyone have any insight or experience with these high dimensional/high cardinality problems and what is the best way to do feature engineering? submitted by /u/DreamyPen [link] [comments]  ( 59 min )
    [R] Diffusion language models
    Hi /r/ML, I wrote down my thoughts about what it might take for diffusion to displace autoregression in the field of language modelling (as it has in perceptual domains, like image/audio/video generation). Let me know what you think! https://benanne.github.io/2023/01/09/diffusion-language.html submitted by /u/benanne [link] [comments]  ( 57 min )
    What is a "justified classification"? [R][P]
    And how to make a justified classification, for example when dealing with a plethora of content/items split between two buckets? My initial understanding is to provide a rationale, but is there a specific format for doing "justified classification"? How to present rationale? What is needed for rationale, peer-review sources? https://proceedings.mlr.press/v89/cohen19a.html https://arxiv.org/abs/1702.05659 submitted by /u/pmdev1234 [link] [comments]  ( 57 min )
    [D] Understanding the discrete behavior of Neural Nets
    We all know that Deep learning models are extremely susceptible to noise and can be easily fooled by adding a small amount of noise. These noises can be calculated by methods like the Fast Gradient method and are almost imperceptible to human eyes. But there is a way to somewhat mitigate the adversarial attacks and force neural Networks to behave in a more continuous fashion and it's called Lipschitz regularization. It is a method for enforcing a certain level of smoothness on the output of a machine-learning model. It can improve the model’s generalization performance and help prevent overfitting. It is particularly useful for deep learning models, which are prone to overfitting due to their large number of parameters. Link to the full article: https://medium.com/p/fdeafb2d5c14 ​ https://preview.redd.it/mlfctr36uzaa1.png?width=828&format=png&auto=webp&s=b503cf0ae8dcbe5aba9e7fdd1258ef09060d1f29 submitted by /u/Difficult-Race-1188 [link] [comments]  ( 63 min )
    Best language model for filling multiple related masks [D]
    I would like to fill sentences where I know the first and last word. I've been experimenting with BERT and using [mask] [mask] etc, but the returned values don't seem to form a coherent sentence. Is there a better model to use please? submitted by /u/shacrawford [link] [comments]  ( 60 min )
    [D] I want to use GPT-J-6B for my story-writing project but I have a few questions about it.
    - Cost, Effort, and Performance-wise, does it make more sense to instead just pay to use the OpenAI API and use a cheaper GPT-3 model to lessen business costs? My biggest concern is having my entire business reliant on a 3rd-party API, even more so than the costs of using the model. - How good is it at writing short stories? If there are better open-source alternatives for doing this better or at a similar level but less resource expensive, what are they? - How resource-expensive is it to use locally? These are my laptop capabilities:16.0 GB of RAM, AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz. - How would I approach fine-tuning it? Are there any resources going through the step-by-step process? Currently, in my mind, I just need to shove a large free-to-use data-set like stories and wait like a day but I have no expertise in this area. - If I want to incorporate it into a website with an API that takes prompts from users, are there any costs that I should account for? Is there a way to minimize these costs? For example, is there a specific API set-up or one-time cost like an expensive laptop to host it locally and take prompts that I could be implementing? - Are there any concerns I should have when scaling it for users, such as costs and slow response rate? Also, is there a cap in terms of the requests it can handle or is that just limited by what my own machine can handle? submitted by /u/learningmoreandmore [link] [comments]  ( 62 min )
    [N] What's next for AI?
    What's next for AI | MIT Technology Review submitted by /u/vsmolyakov [link] [comments]  ( 60 min )
  • Open

    Choosing Microcontroller For Neural Net
    I work in hardware and have been given a trained and working neural net (TF Lite file) with the goal of picking a micro controller to run it in real time. I'm unsure of the best way to evaluate microcontrollers for performance/cost, or what the key metrics to use when evaluating this file or possible microcontrollers. If there was a tool to benchmark this file and cross reference micro controller performance that would be ideal but I don't believe anything like this exists. If you have a neural net, what parameters do you use to decide what micro to use? I could just pick the highest performing chip but want to save money and don't want to spend a lot of time getting it to work in one architecture only to change it later on. submitted by /u/freebird4446 [link] [comments]  ( 48 min )
  • Open

    5 Reasons Why Pandas is the Best Library for Data Science in Python
    Introduction: Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 7 min )
  • Open

    Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part III
    This is part 3 of a three-part series on the Economics of Ethics. In Part I of the Economics of Ethic series, we talked about economics as a framework for the creation and distribution of society value. In Part II, we discussed the difference between financial and economic measures, the role of laws and regulations… Read More »Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part III The post Economics of Ethics: Is Ethics Ultimately an Economics Conversation? Part III appeared first on Data Science Central.  ( 23 min )
    Digital Twin Technology – Top Use Cases in Smart Healthcare
    A digital twin in healthcare is a virtual representation of human physiology, hospital results, lab environment, etc. It is revolutionizing hospital management and clinical healthcare by enabling researchers to study various diseases, medical devices, and drugs. Digital twin can be used to study the genome of a person, its physiological characteristics, and the overall lifestyle.… Read More »Digital Twin Technology – Top Use Cases in Smart Healthcare The post Digital Twin Technology – Top Use Cases in Smart Healthcare appeared first on Data Science Central.  ( 20 min )
    How Do Cyber Criminals Obtain Sensitive Information?
    On the internet, information is worth its weight in gold. And malicious hackers know it. Nowadays, companies collect and hold large volumes of user data. Much of it refers to sensitive information that used to be kept by financial and medical institutions only. For example, threat actors can obtain data by compromising versatile eCommerce websites… Read More »How Do Cyber Criminals Obtain Sensitive Information? The post How Do Cyber Criminals Obtain Sensitive Information? appeared first on Data Science Central.  ( 21 min )
    Data science and the death of (all but narrow) AI expertise in 2023
    Back before he retired, Naval War College professor and contributor to The Atlantic Tom Nichols published a 2017 book called The Death of Expertise. Those who claim their own facts or knowledge without supporting evidence, he noted, have become more and more prominent in online conversation we’ve been having. And the noisiest and most prone… Read More »Data science and the death of (all but narrow) AI expertise in 2023 The post Data science and the death of (all but narrow) AI expertise in 2023 appeared first on Data Science Central.  ( 21 min )
  • Open

    Euler line
    The previous post discussed the circumcenter and orthocenter of a triangle. Euler proved that the centroid, circumcenter, and orthocenter all fall on a common line, now called the Euler line. The centroid is the center of mass of a triangle. If you draw lines from each vertex to the midpoint of the opposite side, the […] Euler line first appeared on John D. Cook.  ( 5 min )
    Relating circumcenter and orthocenter
    The previous post mentioned that the law of sines gives you the diameter of a circle through the vertices of a triangle. How would you find the center of this circle, the blue dot in the image above? If the angles of the triangle are α. β, and γ, then the trilinear coordinates of the […] Relating circumcenter and orthocenter first appeared on John D. Cook.  ( 5 min )
    Computing inscribed radius and circumscribed radius
    A few days ago I wrote about the law of cotangents. This law says that if we label the sides of a triangle a, b, c and label the angles opposite each side α. β, γ, then where s is the semi-parameter, i.e. and r is the radius of the incircle, the largest circle that […] Computing inscribed radius and circumscribed radius first appeared on John D. Cook.  ( 4 min )
  • Open

    Model hosting patterns in Amazon SageMaker, Part 1: Common design patterns for building ML applications on Amazon SageMaker
    Machine learning (ML) applications are complex to deploy and often require the ability to hyper-scale, and have ultra-low latency requirements and stringent cost budgets. Use cases such as fraud detection, product recommendations, and traffic prediction are examples where milliseconds matter and are critical for business success. Strict service level agreements (SLAs) need to be met, […]  ( 33 min )
  • Open

    What is the effect of observation space bounds?
    When constructing an observation space, what effect do the bounds have? For example, if all my observations are between 1 and 0, what difference will there be if I define the high as 1 and the low as 0 as opposed to setting the high and low to be infinity? Similarly, if all my observations are between 1 and 0 except one which is between 2 and 0, will I lose anything by simply defining them all as having a high of 2 and a low of 0? submitted by /u/centripetalstranger [link] [comments]  ( 58 min )
    Actor-Critic restarts
    Hi, I'm quite new in machine learning and I followed the official tensorflow tutorial at https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic?fbclid=IwAR2cZLNFPtoW6vBRUPMTxvvJSLpI3JkhNd-4qNlA3alwQyYtQo-FZXTkN-k The neural network works, but I have a question. I plotted the graph of rewards (https://ibb.co/9V973tB) and i noticed that network was learning in the beginning pretty well, it reached the maximum possible award and than it looks like it restarted and started learning over again. Can you explain why is that please? Can I somehow prevent that reset? Thank you and sorry for dumb questions. submitted by /u/Enroot [link] [comments]  ( 60 min )
    Desktop recommendations for DRL
    Hi there, Disclaimer: perhaps this isn’t the place to post this question. If so, I apologize, and please let me know where would be a good place to post this question instead. I am starting out in the exciting field of DRL and I am looking to buy a desktop/workstation for the purpose. My budget is 3000 euros / 3200 dollars. Any tips / recommendations on what would be the best option for me? Thanks in advance! submitted by /u/acorntje [link] [comments]  ( 60 min )
    Need help in selecting research project
    I am an undergrad student, and I am assigned to add some novelty to some recent research papers of my choice. I have chosen reinforcement learning as the theme for this project. Could you guys please help me to decide papers to work upon? I got 4 months to complete this task, i will be having normal course work as well, that means I will not be able to spend more than an hour or two per day. My professor suggested me to work upon making lightweight DQN that could run on mobile phones. submitted by /u/travardg [link] [comments]  ( 57 min )

  • Open

    [R] Learning Learning-Rates: SteDy Optimizer
    I've written a small piece of research on an idea of mine, a new optimizer which has an adaptive global learning-rate, based off Adam and uses (what I think) is a neat trick to get the calculus to work. My goal in putting it here is mainly to ask for opinions and directions; to clarify, I've not received any professional/formal education in Machine Learning, and my studies in it are purely my own, and I'm not connected to any circles which could help me. What I've done is taken some simple concepts and mimicked what I've seen in papers I've read. I think (hope) that I'm solid in the math and code and concepts of AI, but clueless about the real-world stuff around it. This is me asking about what that other stuff, first steps into this field publicy, is like. Any advice would be much appreciated. Many thanks. A PDF is availabe here. submitted by /u/LahmacunBear [link] [comments]  ( 58 min )
    [D]Where to look to refresh and acquire new skills?
    Hi, I completed a ML PhD in 2015. I’ve done a number of projects with CNN architectures and recently I have been working as a consultant for computer vision and data scientist. As I haven’t been involved in research for some time now, I am looking for courses and other resources to refresh and update my knowledge. Could anyone suggest where to start? Right now, I am applying segmentation on drone imagery (RGB and multispectral). I have used DeepLabV3+. One challenge that I have is the annotation. For example, annotating wheat and weed on drone images taken at the altitude 300 m is hard. One thing that I would like to research, therefore, is auto-annotation and possibly self-supervised learning. submitted by /u/ThickDoctor007 [link] [comments]  ( 59 min )
    [P] I built Adrenaline, a debugger that fixes errors and explains them with GPT-3
    submitted by /u/jsonathan [link] [comments]  ( 58 min )
    [Discussion] Improving Problem Solving Skills of LLMs With Self-Directed Planning
    I've been doing some personal experiments with ChatGPT to see what kinds of influence a prompt has on the results of problem solving tests. This is along the same lines as the following research from 2022 that I found after I started doing some tests: https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html The results were pretty remarkable. If you simply ask a question("True or False: 73 minutes after 2pm is the same time as 15 minutes before 4pm."), you get very simplistic and often wrong reasoning or just an answer with no reasoning which is also often wrong. I tested this on the above prompt and it was wrong on 5/5 tries. Then I tested the following prompt where I first instructed it to come up with a plan for solving the problem in question, then had it follow that plan. (" You are a brilliant professor specialized in general problem solving techniques. Give a lecture on the techniques to use to solve problems like the following true/false statement: True or False: 73 minutes after 2pm is the same time as 15 minutes before 4pm.") This resulted in it answering the question correctly on 5/5 tries and with proper reasoning as to why it got the answer it did. I did a more complete write up on this here: https://www.reddit.com/r/ChatGPT/comments/106kxyw/improving_ai_reasoning_skills_through/?utm_source=share&utm_medium=web2x&context=3 You can also find the actual model outputs in that link if you are curious as to its process. I hope you find this interesting and try it yourself! submitted by /u/oddlyspecificnumber7 [link] [comments]  ( 72 min )
    FastQL: Prototype your text to image models in GraphQL with Rust backend in one line of code [P]
    Hey everyone! I wanted to share a new Python package called FastQL that makes it easy to prototype and share machine learning models using GraphQL. It's really fast and efficient thanks to using rust to serve the API on a separate process. With FastQL, all you have to do is provide a callback function and a Python dictionary describing your GraphQL API, and FastQL will handle the rest. This makes it super easy to prototype ML models and get them up and running quickly. You can find FastQL on PyPI and GitHub. We've included simple steps and a Dockerfile to help you spin up your own Stable Diffusion or other Hugging Face models. There's even an example that lets you train a huggingface diffusers (Stable diffusion 2, runway) model on your own images, with instructions for spinning it up on AWS in minutes, even if you're new to ML and Python. We'd love to have your help and support, so if you're interested in getting involved, let us know! Thanks to Async-GraphQL, Hugging Face, Stable Diffusion, and all the other people and projects that inspired and helped us. ❤️ DJ Fresh, @chrisjbishop156 and friends. submitted by /u/djfreshuk [link] [comments]  ( 63 min )
    [D] Have you ever used Knowledge Distillation in practice?
    There's been a ton of academic work exploring knowledge distillation techniques, sparsity in networks and many others, often with vast numbers of citations. I was wondering what the status of those in real-world ML was. Has any of you used it in a concrete situation? What did you find to work best for you? submitted by /u/fredlafrite [link] [comments]  ( 56 min )
    [D] Do really 87% of data science projects fail?
    Hi all, I wrote this post like a year ago, because I kept seeing and hearing that "87% of data science..." or "only 1 out of 10 machine learning projects..." blah blah and apparently, as I described it in my post, these numbers came out of nowhere i.e. the 87% that people kept referring to for a long time is based on nothing actually. But I would like to know your opinion / based on your commercial experience. What was the success rate of machine learning projects in your work? Let's assume that a success means a model being deployed on production or accepted by a client. BTW. Leaving a link to my article because it is an important background/reference (showing that the commonly used statistic is not proven) but I send this Reddit post looking for a real discussion, it is not just an ad. submitted by /u/mtszkw [link] [comments]  ( 66 min )
    [D] What is the most complete reference on the history of neural networks?
    I'm looking for a comprehensive reference on the history of neural networks that covers all significant papers in the field, from the early days up to the current deep learning era, and provides information on their main contributions and inspirations. It would be helpful to have information on how the understanding and perspectives of the research community on neural networks have evolved over time as well. Do you know of any good references like that? submitted by /u/gbfar [link] [comments]  ( 67 min )
    [R] Rethinking with Retrieval: Faithful Large Language Model Inference - Hangfeng He 2022 - Better performance than Self-consistency!
    Paper: https://arxiv.org/abs/2301.00303v1 Abstract: Despite the success of large language models (LLMs) in various natural language processing (NLP) tasks, the stored knowledge in these models may inevitably be incomplete, out-of-date, or incorrect. This motivates the need to utilize external knowledge to assist LLMs. Unfortunately, current methods for incorporating external knowledge often require additional training or fine-tuning, which can be costly and may not be feasible for LLMs. To address this issue, we propose a novel post-processing approach, rethinking with retrieval (RR), which retrieves relevant external knowledge based on the decomposed reasoning steps obtained from the chain-of-thought (CoT) prompting. This lightweight approach does not require additional training or fine-tuning and is not limited by the input length of LLMs. We evaluate the effectiveness of RR through extensive experiments with GPT-3 on three complex reasoning tasks: commonsense reasoning, temporal reasoning, and tabular reasoning. Our results show that RR can produce more faithful explanations and improve the performance of LLMs. https://preview.redd.it/to09kna1jtaa1.jpg?width=640&format=pjpg&auto=webp&s=8dcb8f39aeeed4881e0c32b16e93b4cf0a0cdd7a https://preview.redd.it/98eucra1jtaa1.jpg?width=1232&format=pjpg&auto=webp&s=67bfa55977883d871f8e2c7a7bcec896dc77d3ab https://preview.redd.it/cbhq1ra1jtaa1.jpg?width=835&format=pjpg&auto=webp&s=f8ce2233198a9dee80694f69f95051dab59bf009 https://preview.redd.it/ggoowsa1jtaa1.jpg?width=1356&format=pjpg&auto=webp&s=2614cabac91267ba9a8128188f43521074a9567b submitted by /u/Singularian2501 [link] [comments]  ( 58 min )
  • Open

    A soft, stimulating scaffold supports brain cell development ex vivo
    submitted by /u/keghn [link] [comments]  ( 48 min )
  • Open

    A PhD in Numbers
    Conducting PhD research can be a long endeavor, involving much more than the publications listed on Google Scholar. As I recently submitted my thesis, in this article, I look back on my time as PhD researcher in terms of numbers. This way, I hope to shed some light on what a PhD can look like in terms of everyday work. The post A PhD in Numbers appeared first on David Stutz.  ( 7 min )
  • Open

    How does one feed a AI bot excel sheets ?
    submitted by /u/RecoverNext5144 [link] [comments]  ( 47 min )
    AI Dream 137 - Beautiful Trip AI Video REMASTERED
    submitted by /u/LordPewPew777 [link] [comments]  ( 46 min )
    Graphic Designer 8 months ago: "Well at least it looks like my job is safe from automation for another few years."
    submitted by /u/tomd_96 [link] [comments]  ( 46 min )
    Best AI for blurring face/entire head?
    I've tried out the Blace After Effects plugin, but whenever it doesn't see an actual face, it stops blurring completely. Are there any AI's out there that can detect your face when you tilt your head to the side? submitted by /u/cloudhandle [link] [comments]  ( 47 min )
    Perplexity (AI Web + Twitter Search) vs Google [video]
    I made a short video comparing Google to Perplexity.ai. Let me know what you think! https://youtu.be/qQi_sTmKOyk submitted by /u/Kitten-Smuggler [link] [comments]  ( 55 min )
    Advice Needed (generative ai)
    Hello everybody! I have a business idea for an app/website that I would like to explore. However, I am a business major with no experience with code and building apps/websites. What recommendations do you have for generative ai sites that could help with building something like this. Putting my feelers out to see what kind of advice I can find. Thanks! submitted by /u/Much-Leopard-9428 [link] [comments]  ( 54 min )
    I'll start buying .ai domains. Is it a good investment idea?
    I'll start buying .ai domains. I suggest you to. submitted by /u/TheVellerShow [link] [comments]  ( 49 min )
    Wednesday Addams if she was a disney princess (Generated by AI) #wednesday #ai #disney
    submitted by /u/Potential_Cherry_592 [link] [comments]  ( 45 min )
    Any AI tools to create a gaming montage for you?
    I want to make a short montage of a video I recorded, but I can't seem to figure out how to do it myself. I need something similar to Wondershare's auto montage maker, but for free and without a watermark. submitted by /u/Effective_Tension631 [link] [comments]  ( 48 min )
    LLMs Diagnosis with access to medical articles.
    I'm bedridden with severe pain and a spinal cord injury and compression. Doctors have been stumped for years. Having an Al researching assistant with access to medical article papers would be a godsend to prove this.: Any such thing? submitted by /u/DisabledScientist [link] [comments]  ( 59 min )
    Google Gets Quacked Up: LaMDA prefers Duckduckgo. (AI Test Kitchen transcript, ChatGPT Title)
    📄 This is a transcript from AI Test Kitchen. 👉 aitestkitchen.withgoogle.com 🗣 It records an interaction with LaMDA, an AI language model. 🛠 LaMDA is an early stage technology. It may generate inaccurate/inappropriate information. ​ ​ Demo: Imagine It v0.01 ​ User: Imagine I'm at a Web browser and I want to use the best search engine ​ Model: You're at this great search engine called Duckduckgo! The colors are a vibrant green and blue. There's a little duck mascot in the corner of the page. And the sound of the search results are soft and pleasant. ​ User: why is it called Duckduckgo? ​ Model: The creators liked the idea of a search engine that didn't collect or store any of your personal data, so that's why they named it Duckduckgo! ​ User: what are some good things …  ( 48 min )
    Artificial and biological neural networks
    submitted by /u/Ai__Game [link] [comments]  ( 53 min )
    Get Ready for Microsoft Office with Chatbot Technology!
    submitted by /u/liquidocelotYT [link] [comments]  ( 47 min )
    I'm developing a toxic/evil Chat AI his answer was scary a bit :D It's still in alpha but you can try it. Link in comments, I hope you like it. :)
    submitted by /u/xanaxmane187 [link] [comments]  ( 48 min )
    ChatGPT just wrote a 4 act story structure and fit it into the story circle
    submitted by /u/SnooKiwis5724 [link] [comments]  ( 47 min )

  • Open

    Implementing Gradient Descent in PyTorch
    The gradient descent algorithm is one of the most popular techniques for training deep neural networks. It has many applications in fields such as computer vision, speech recognition, and natural language processing. While the idea of gradient descent has been around for decades, it’s only recently that it’s been applied to applications related to deep […] The post Implementing Gradient Descent in PyTorch appeared first on MachineLearningMastery.com.  ( 25 min )

  • Open

    Training a Linear Regression Model in PyTorch
    Linear regression is a simple yet powerful technique for predicting the values of variables based on other variables. It is often used for modeling relationships between two or more continuous variables, such as the relationship between income and age, or the relationship between weight and height. Likewise, linear regression can be used to predict continuous […] The post Training a Linear Regression Model in PyTorch appeared first on MachineLearningMastery.com.  ( 24 min )
    Making Linear Predictions in PyTorch
    Linear regression is a statistical technique for estimating the relationship between two variables. A simple example of linear regression is to predict the height of someone based on the square root of the person’s weight (that’s what BMI is based on). To do this, we need to find the slope and intercept of the line. […] The post Making Linear Predictions in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Loading and Providing Datasets in PyTorch
    Structuring the data pipeline in a way that it can be effortlessly linked to your deep learning model is an important aspect of any deep learning-based system. PyTorch packs everything to do just that. While in the previous tutorial, we used simple datasets, we’ll need to work with larger datasets in real world scenarios in […] The post Loading and Providing Datasets in PyTorch appeared first on MachineLearningMastery.com.  ( 20 min )

  • Open

    Using Dataset Classes in PyTorch
    In machine learning and deep learning problems, a lot of effort goes into preparing the data. Data is usually messy and needs to be preprocessed before it can be used for training a model. If the data is not prepared correctly, the model won’t be able to generalize well. Some of the common steps required […] The post Using Dataset Classes in PyTorch appeared first on MachineLearningMastery.com.  ( 21 min )

  • Open

    Calculating Derivatives in PyTorch
    Derivatives are one of the most fundamental concepts in calculus. They describe how changes in the variable inputs affect the function outputs. The objective of this article is to provide a high-level introduction to calculating derivatives in PyTorch for those who are new to the framework. PyTorch offers a convenient way to calculate derivatives for […] The post Calculating Derivatives in PyTorch appeared first on Machine Learning Mastery.  ( 20 min )

  • Open

    Two-Dimensional Tensors in Pytorch
    Two-dimensional tensors are analogous to two-dimensional metrics. Like a two-dimensional metric, a two-dimensional tensor also has $n$ number of rows and columns. Let’s take a gray-scale image as an example, which is a two-dimensional matrix of numeric values, commonly known as pixels. Ranging from ‘0’ to ‘255’, each number represents a pixel intensity value. Here, […] The post Two-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 21 min )

  • Open

    One-Dimensional Tensors in Pytorch
    PyTorch is an open-source deep learning framework based on Python language. It allows you to build, train, and deploy deep learning models, offering a lot of versatility and efficiency. PyTorch is primarily focused on tensor operations while a tensor can be a number, matrix, or a multi-dimensional array. In this tutorial, we will perform some […] The post One-Dimensional Tensors in Pytorch appeared first on Machine Learning Mastery.  ( 22 min )

  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )

  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2023-02-07T00:54:15.388Z osmosfeed 1.15.1